Real-time emulation and analysis of multiple NAND flash channels in solid-state storage device

Real-time emulation and analysis of multiple NAND flash channels in solid-state storage device

Real-time Emulation and Analysis of Multiple NAND Flash Channels in Solid-state Storage Devices Journal Pre-proof Real-time Emulation and Analysis o...

3MB Sizes 3 Downloads 46 Views

Real-time Emulation and Analysis of Multiple NAND Flash Channels in Solid-state Storage Devices

Journal Pre-proof

Real-time Emulation and Analysis of Multiple NAND Flash Channels in Solid-state Storage Devices Nikolaos Toulgaridis , Eleni Bougioukou , Maria Varsamou , Theodore Antonakopoulos PII: DOI: Reference:

S0141-9331(19)30169-3 https://doi.org/10.1016/j.micpro.2019.102986 MICPRO 102986

To appear in:

Microprocessors and Microsystems

Received date: Revised date: Accepted date:

21 March 2019 1 September 2019 30 December 2019

Please cite this article as: Nikolaos Toulgaridis , Eleni Bougioukou , Maria Varsamou , Theodore Antonakopoulos, Real-time Emulation and Analysis of Multiple NAND Flash Channels in Solid-state Storage Devices, Microprocessors and Microsystems (2019), doi: https://doi.org/10.1016/j.micpro.2019.102986

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Real-time Emulation and Analysis of Multiple NAND Flash Channels in Solid-state Storage Devices✩ Nikolaos Toulgaridis, Eleni Bougioukou, Maria Varsamou, Theodore Antonakopoulos Department of Electrical and Computer Engineering University of Patras, Patras, 26504, Greece

Abstract NAND Flash is the most prevalent memory technology used today in data storage systems covering a wide range of applications, from consumer devices to high-end enterprise systems. In this work, we present a modular and versatile FPGA-based platform that achieves accurate emulation of multiple NAND Flash channels. The NAND Flash emulator is based on an expandable and reconfigurable architecture that can be used for developing and testing new NAND Flash controllers and for analysing the behaviour of existing NAND Flash controllers and/or host device drivers. The presented NAND Flash emulator is based on PCIebased FPGA boards attached to a high-end server, supports standard memory interfaces, responds to all memory commands in proper time and has the capability to emulate memory space in the range of a few TBs. The NAND Flash emulator has been prototyped and tested, and experimental results demonstrate that all timing requirements are satisfied under maximum read/write workloads. The NAND Flash emulator includes also a hardware tracer unit that records information of all commands exchanged at the NAND Flash interfaces along with high resolution timestamps. The recorded information can be used to analyse higher level functions, like wear-leveling and garbage collection, and combined with other software tools for analysing cognitive functions. Experimental results demonstrate the advantage of using this emulator for analysing how host device drivers implement wear-leveling and garbage collection functions. Keywords: NAND Flash, FPGA emulator, wear-leveling, garbage collection, device drivers.

1. Introduction NAND Flash devices are used in all types of storage systems, i.e. USB Sticks/Drives, mobile phones, solid-State Drives (SSDs). USB Drives are used in various commercial applications and in many cases as low-cost secondary storage devices, for example in automotive applications. NAND Flash devices where used extensively in mobile phones as their main storage unit. The last few years new interface technologies have been developed in this technological area, like eMMC and UFS. NAND Flash-based solid state drives (SSDs) have emerged as a low-cost, high-performance and reliable storage medium for commercial and enterprise storage needs. In all these storage systems there is a main NAND Flash controller and a number ✩ This manuscript is an extended version of the paper: Real-time Emulation of Multiple NAND Flash Channels by Exploiting the DRAM Memory of High-end Servers, The Euromicro Conference on Digital System Design (DSD-2018), August 29-31, 2018, Prague, Czech Republic. Email addresses: [email protected] (Nikolaos Toulgaridis), [email protected] (Eleni Bougioukou), [email protected] (Maria Varsamou), [email protected] (Theodore Antonakopoulos)

Preprint submitted to Elsevier

January 3, 2020

of NAND Flash channels. This controller communicates with the host machine using an interface like USB, SATA, PCIe, etc. and implements various NAND Flash related functions, such as wear-leveling, garbage collection, etc. [1]. For achieving high data rates during read/write operations, a storage device uses a number of independently operating NAND Flash channels. Each NAND Flash channel contains a number of interconnected memory dies, that share the same data bus and some control signals. Maximum transfer rate per channel is achieved by pipelining commands to different memory dies [2]. Various non-volatile memory technologies are used in storage devices [3]. These technologies include NAND Flash (SLC, MLC, TLC, 3D), PCM etc. [4], [5], [6]. Depending on the application area where the memory is used, different interfaces have been developed, like ONFI, Toggle, eMMC, UFS, LPDDR etc. [7], [8], [9], in order to satisfy the user requirements in terms of speed, power consumption and I/O availability. At the initial phases of the design of a storage device, the developers use mainly simulators for studying the effect of various design parameters. Depending on the required accuracy and the complexity of these simulators, various memory models have been developed. Since the simulators can not satisfy the real-time specifications of a real storage device, at an advanced design phase an accurate emulator has to be used, especially during the development and debugging phases of the microcode of the storage controller. The use of the emulator is mandatory when the memory chip is still in development phase, but it is also indispensable when tests have to be preformed by starting at a given ageing state of the memory device. Worldwide, many NAND Flash-based emulators have been presented, either for implementing detailed timing models [10] or for analyzing the SSD’s internal behavior [11], [12]. There are also some simulators that facilitate the development of flash memory management software, such as the flash translation layer [13]. In this work, we present a hardware non-volatile memory emulator (presently its NAND Flash version) that can be used for developing and testing new NAND Flash controllers with multiple channels, even before the actual memory ICs are available. The proposed emulator can also be used to emulate the ageing effect of NAND Flash memories and how it affects the performance and the reliability of a storage device. One of the main problems that have to be encounter in such an emulator is the huge memory capacity that has to be supported and at the same time to satisfy the response time of a few tens of usecs of a memory chip during a data read process. Nowadays, NAND Flash ICs with capacities of a few tens or even hundreds of GBytes are commercially available, resulting to a large memory space per NAND Flash channel. Considering the large number of NAND Flash channels used per USB/SSD, the total memory capacity is in the range of a few hundreds of GBytes and in many cases a few TBytes. Such a memory capacity with fast access time is available in high-end computing servers and accessing it in proper time can be achieved only if the server’s operating system is not involved during the memory access. The architecture proposed in this work confronts that issue efficiently. Measuring the performance and analysing the functionality of NAND Flash controllers is another application area where a NAND Flash emulator can be used. In this case, the NAND Flash emulator must be initialized with the content of the NAND Flash ICs in order to demonstrate exactly the same behaviour as the real device. In both cases, a programmable emulator is needed that supports standard memory interfaces, responds in proper time in all commands, and has the capability to emulate the whole memory area (in the range of a few hundred of GBs per channel). In this work, we present a new approach which is based on a single or multiple PCIe-based FPGA boards attached to the PCIe slots of a high-end server and uses the server’s DRAM as a huge shared memory that each emulated channel can access independently. The NAND Flash emulator has been prototyped and tested and experimental results demonstrate that all timing requirements are satisfied under demanding read/write workloads. The remainder of this paper is organized as follows. Section 2 analyzes the current trends on NAND Flash technologies and devices that specify the functionality of the presented emulator. Section 3 gives a

concise description of the interconnect technology used in servers and how a PCIe-based board can access the host’s DRAM. Section 4 describes the proposed architecture, emphasizing on the hardware units developed for achieving proper execution of read, write and erase commands per NAND Flash channel. This section also describes how the NAND Flash content is allocated to the host’s DRAM. Section 5 demonstrates experimental results of the NAND Flash emulator and illustrates the advantages of the proposed emulator. Finally, Section 6 presents how the information collected using the tracer of the presented NAND Flash emulator can be used for analysing the wear-leveling and garbage collection functions implemented in well known device drivers. 2. NAND Flash Technology and Devices NAND Flash memories are based on Flash cells, cells of floating gate transistors (FGTs). The effective threshold voltage of FGT, and respectively its I-V characteristic, depends on the charge stored in its floating gate [14]. NAND Flash acquires non-volatile properties as the floating gate is surrounded by dielectrics which ensure the reliable isolation of the trapped charge for long periods of time [1]. Groups of FGTs sharing the same bit-line are connected in series, forming strings, while logical NAND Flash pages are formed by cells sharing the same word-line. All strings of cells sharing the same group of word-lines form a NAND Flash block. Fig.1 illustrates the block diagram of a NAND Flash chip, which consists of a memory cell array, page buffers, program and read circuits and an I/O interface. Internally the memory array can be organized in independent planes and multiple data buffers, providing the capability to support caching and more advanced multiplane commands. The interface between the NAND Flash chip and its controller consists of three sets of I/O signals. There is a bidirectional data bus, a set of control signals, always input to the memory chip and some clocks and data strobe signals, depending on the mode of operation. As shown in Table 1, there are two main NAND Flash interfaces, ONFi and Toggle [7], [15] with multiple variations. Different NAND Flash interfaces are related to different voltages, transfer rates and types of data transfer (SDR and DDR [16]). In a NAND Flash device, the Digital I/O and Control Logic decodes the commands sent by the controller through the NAND Flash Interface and activates the respective internal FSM [17]. When a Page Program command is executed, data are stored in the internal page buffer (I/O DATA Buffer) and then are stored to the cells of the NAND Flash array using programming circuits. Respectively, when a Page Read command is executed, data are read from the cells array with the use of read/sense circuits and are stored to the page buffer, by exploiting some hard-decision circuits. The Address Decoder determines the column and row address of the command to be executed and activates the program and read circuits to perform operations on the specified cells. When a Block Erase command is executed, a set of pages is programmed to the erase state. There is a set of additional commands either for setting up the memory chip (i.e. the Set Features command) or for speeding up the information access (i.e. change column, cache, copy-back). A NAND Flash channel consists of a number of NAND Flash devices that share the same data bus and control signals, while some separate signals are used by the NAND Flash controller either for selecting a specific NAND Flash die or for sensing its status. Increasing the number of dies that share the same bus may increase the pipeline depth, and consequently the utilization of the NAND Flash channel, but that may also result in slower transfer rate due to the introduced impedances on the data bus lines. For increasing the data rate of the whole NAND Flash storage system, multiple independent channels are used which operate in parallel. This gives more flexibility to the NAND Flash controller, increases storage capacity but also adds more computational burden.

In order to design a NAND Flash emulator that can emulate such a storage device a number of technical problems have to be resolved. Mechanical access of the sockets/footprints of all NAND Flash devices is a technical issue not addressed in this work. The other two main issues are addressed in the following sections: how each individual channel is supported and how the total memory is emulated and accessed for satisfying the NAND Flash interface and manufacturer specifications. 3. Memory of the NAND Flash Emulator and the use of DRAM in Xeon Servers Although the presented NAND Flash emulator is independent to the used host processor and the only requirement (that is usually met) is to support the PCIe interface, in this section, we base our description on the widely used Xeon processors. These processors use a number of cores connected to shared, LastLevel Cache (LLC) modules by a high-bandwidth interconnect (QuickPath Interconnect) [18]. The cores and shared LLCs are interconnected via one or more caching agents to the rest of the system. For accessing the system’s DRAM, the Xeon has one or more memory controllers, where each controller covers a unique portion of the total memory address space. Each memory controller is connected to the system by a home agent. The architecture of the Intel Xeon processor E5 v4 MCC used in our emulation system is shown in Figure 2. The red rings connect the L2 caches to portions of the L3 cache, as well as to the QuickPath Interconnect links, PCIe links, and to the home agents of the memory controllers. A memory access that starts at a core and misses the L1 and L2 caches goes along the buses to the home agent of the target memory controller for that specific physical address [19]. If the destination memory controller is on a different socket in a multiprocessor system, the traffic goes over the QPI link to a ring bus on the target processor. There are multiple memory controllers near the bus, each controlling multiple channels. Each home agent recognizes the physical addresses targeting its channels. The home agent translates the physical address into a channel address and passes it to the memory controller. Each memory controller is responsible for translating the physical address into a memory address (row, column, bank, stacked rank, DIMM). The exact mapping from system addresses to the actual physical structure of DRAM is sparsely documented, so we had to rely on others works [20] along with some Intel’s datasheets [21], [22] in order to understand the address decoding architecture. Our server is equipped with an Intel Xeon E5-2650 v4 processor (Broadwell) that has 4 memory channels and 8 DRAM DIMMs. This processor has 12 cores, so it belongs on the second die configuration of Broadwell named MCC (Medium Core Count). These dies have two memory controllers, so each controller handles two memory channels. By using the Intel PCM (Performance Counter Monitor) tool, we ended up that memory controller 0 handles memory channels 0 and 2, while memory controller 1 handles memory channels 1 and 3. Intel PCM tool is a tool to monitor performance hardware counters on Intel processors. The address decoding from physical addresses to physical locations in DRAM in Intel Xeon processors depends mainly on two components, the source address decoder (SAD) and the target address decoder (TAD). SAD is responsible for routing a memory request when a cache miss of the last level cache occurs or in the case of an uncacheable memory request. SAD compares the physical address with a set of configured regions. Each region corresponds to a Non-Uniform Memory Access (NUMA) node. So, SAD finds in which NUMA node to route the request. Our processor has only one NUMA node so SAD routes all requests to the same node. Each NUMA node has two components in order to route the requests to DRAM. The home agent handles the cache coherency protocol and the memory controller interfaces with the memory channels. Each NUMA node has its own TAD. TAD appears to be split between the home agent and memory controller. TAD matches the physical addresses with its own set of regions. Each region defines the target memory channels for the physical addresses or a set of target channels for interleaving. In our system, we

have disabled channel interleaving through bios, since in the NAND Flash emulator architecture, latency is the most critical parameter, not the DRAM access rate. The TAD DRAM region registers are contained in the home agent’s and the integrated memory controller’s TAD configuration spaces. After accessing these registers we concluded that for a system that uses 8 DIMMS of 16 GBytes each, channel 0 handles the physical address from 0 to 2 GB and from 34 to 66 GB, channel 1 the addresses from 2 to 34 GB, channel 2 the addresses from 66 to 98 GB and channel 3 the addresses from 98 to 130 GB. Finally, the addresses of LLC are mapped through a hash function to DRAM physical locations [23]. Using DIMMs of higher capacity (i.e. 128 GB/DIMM) extends the memory space beyond 1 TB and a similar memory allocation is applied. Understanding how the server’s memory space is organized is crucial in our development, since the content of the emulated NAND Flash memory has to be mapped in the server’s DRAM, either for accessing its content when a Page Read is executed or for updating its content when Page Program or Block Erase commands have been issued. It is worth to mention at this stage, that in order to minimize the disturbances form the host processor and the operating system, the total server’s memory has been split in two regions. The first region is used by the operating system as the typical server’s memory, while the second region is used exclusively by the NAND Flash emulator. The second region is initialized by the host during system start-up and is accessed only for getting statistics. That allows the use of a fast hardware mechanism to access the emulated memory content, as described in the nest section. Depending on the capacity of the emulated storage device, the host DRAM is split so that the whole emulated memory area is supported and the rest DRAM is allocated to the operating system (Ubuntu in our case). 4. The Real-Time Multiple-Channels Non-Volatile Memory Emulator In [17] we presented a NAND Flash emulator for memory devices, focusing mainly on the ageing effect of NAND Flash memories due to program-erase cycles. In this work, we present a Real-time Multiplechannels NAND Flash emulator (RTMC-NFE) that can be used in devices that have a number of NAND Flash channels. Although the emulator was initially designed for NAND Flash devices, it can also be used for other non-volatile memory (NVM) technologies by adjusting mainly its front-end. Therefore,in the first subsection, for the general architecture of the presented emulator we’ll use the RTMC-NVM terminology. 4.1. The RTMC-NVM Architecture A RTMC-NVM emulator must support a number of parallel and independently operating channels, and it must also use volatile memory, like DRAM, for storing and updating the NVM content. When a command (read or program/erase) is executed at the NVM interface, the content of the respective page must be accessed in due time in order to satisfy the timing requirements of the used NVM interface and the memory manufacturer’s specifications. Therefore a dedicated processing engine has to be used per NVM channel, and the read/program requests must be either multiplexed or routed to dedicated processing units that have access to the volatile memory. Considering a single channel, in Fig.3 we show the general architecture of such an emulator. The RTMC-NVM emulator consists of three main units: the Digital Front-End (DFE) unit, the main Non-Volatile Memory Emulation (NVME) unit and the Host platform where the main DRAM is located. The Digital Front-End (DFE) unit supports either a single or a small number of NVM channels, it is directly attached to the emulated storage device (also referred as Device-Under-Test (DUT)) and uses a single FPGA. DFE communicates with the main NVME unit using dedicated high-speed links, usually optical at x10 Gbps, and its power is controlled by the main NVME. DFE is programmed during initialization with the

parameters of the emulated NVM and either responds using local information or by communicating with the main NVME. The main NVME is the central processing unit of the RTMC-NVM emulator. It is implemented on a FPGA-based board attached to the host’s motherboard using PCIe. A main NVME may communicate with one or multiple DFEs, and a number of main NVMEs may be used in an emulator. In cases where the total NV emulated memory exceeds the memory capacity of a single server, multiple servers can be used, using a single or multiple main NVMEs per server, as indicated in Fig.4. In this case, the additional latency introduced by the x10 Gbps links can be amortized by implementing caching functions on the DRAM of each main NVME board, or by implementing pipelining functions during data transfers. The motherboard of a server has a number of PCIe slots, where the main NVMEs can be attached, and a number of DRAM DIMMs, organized as described in Section 3. From the whole memory space of a server, a small portion is allocated to the OS (i.e. Ubuntu) and the rest is used exclusively as a dedicated memory space for the emulated NV memory. The proposed architecture is flexible; it can be adjusted according to the needs of the emulated storage device (number of channels, dies per channel, type of NVM, total capacity, IO voltages, etc.). In the rest of this section, we present details of a NAND Flash emulator that is based on the above-described architecture, while in the next section we present experimental results that demonstrate its performance. 4.2. The RTMC-NFE Architecture In this section, we present details of the architecture of RTMC-NVM for NAND Flash memories, the so called Real-time Multiple-Channels NAND Flash emulator (RTMC-NFE). We consider a configuration that supports two NAND Flash channels, with up to four dies per channel and each die has two independently operating Logical Units (LUNs), while two planes per LUN are used. The whole emulator has been implemented in a single server that supports up to 1 TByte DRAM. In this development, the previously described units, DFE and main NVME, have been combined in a single FPGA implementation (NFE) and the whole hardware logic has been implemented in a PCIe-based board, that uses Gen.3 PCIe with 8 lanes. Fig.5 shows the architecture of the RTMC-NFE that has been implemented on a Xilinx Ultrascale FPGA. NFE uses a soft-CPU for initialization and control purposes. During initialization, the host’s processor downloads the content of the NAND Flash memory chips to the DRAM allocated for emulation and then updates the emulated system parameters on the FPGA’s RAM. As a next step, the soft-CPU initializes the two major units, the Flash Cmds Processing Unit (FCPU) and the DRAM Access Module, described in the next subsection. The FCPU uses a front-end module for implementing the physical NAND Flash interface (SDR and DDR) and processes all received commands. If a command can be supported without accessing the host’s DRAM (i.e. Read ID), the FCPU retrieves the proper values from the Parameters RAM and by using the Local Cmds Processing unit responds immediately. Otherwise, it passes the respective information to the proper DRAM Access Module, which retrieves/updates the NAND Flash data. Whenever a command is executed, tracing information is generated along with statistics (i.e. response time, elapsed time from the previous command) and a custom debugger/tracer is used to store this information to the local DRAM for further processing. More information about the tracer is given in subsection 4.4. The soft-CPU communicates with the OS applications by using a dedicated Device Driver [24]. The Device Driver can accept requests from multiple applications at the same time. These requests are queued and dispatched to the sRTMC-NFE by using a flexible PCIe-based interface. The Device Driver responds with the relevant completions to the corresponding applications. All requests/completions are exchanged through a list of descriptors stored in a shared memory space on the main memory of the host. Also, for exchanging control information and data, a set of registers in the PCIe address space and an interrupt mechanism is used.

The RTMC-NFE uses two SSDs for storing the content of the emulated NAND Flash devices and the tracing information. The PCIe attached M.2 NVMe SSD results in fast initialization, a time consuming operation when large memory capacity has to be emulated. 4.3. The DRAM Access Module The DRAM Access Module is the main functional unit for accessing the emulated NAND Flash data stored in the host’s DRAM. Its architecture is shown in Fig.6. A single DRAM Access Module is associated with each NAND Flash channel, although the same DRAM Access Module may support multiple dies (CEs) that share the same NAND Flash interface. Access to the host’s DRAM is achieved by using PCIe packets, and this allows time multiplexing of requests from a number of simultaneously acting DRAM Access Modules. That supports the basic functionality of a multiple channels emulator, that is to emulate a number of channels acting independently in parallel. The DRAM Access Module supports data transactions from the FPGA’s internal buffers to the host’s DRAM and vice-versa, in order to implement the three basic commands that are related with the NAND Flash content: Page Read, Page Program, Block Erase. The DRAM Access Module receives commands through the Command FIFO (CMD) and decodes it according to the parameters received by the soft-CPU, a MicroBlaze processor in our case. The data structure contains information like channel, CE, LUN, page or block and this is translated to physical DRAM addressing information, where the actual data are stored, in order to program the DMA engines. The RWE Processor, embedded in each DRAM Access Module, handles the above information and is responsible for programming the respective DMA engine. There are three DMA engines, one dedicated to each command type (Page Read, Page Program, Block Erase). Moreover, the module contains two FIFOs, one for the data associated with Page Program and one for the Page Read data. For Block Erase there is a register set to the erased value so that the corresponding block in the host DRAM is filled with this value, once a Block Erase command is executed. All transfers are performed using AMBA-AXI4 [25]. The DMA engines are connected as masters to the main AXI4-Interconnect [26], where all memory devices, including the PCIe IP Core [27], are connected as slaves. The aforementioned PCIe IP Core is responsible for all transactions related to the host DRAM. The RWE Processor monitors the DMA status in every transfer and generates a new response in the Response FIFO at the end of each data transaction. When a Page Read command is applied, the command’s parameters are stored and programming of the PR DMA engine takes place. Getting the data over PCIe usually takes a few usecs (5 usecs for a 16K page over PCIe Gen. 3 with 8 lanes) and then the data are transferred to the data buffer of the respective NAND Flash channel. The total time is much less than the typical Read time, which is usually between 30 to 50 usecs. Then the R/B signal is deasserted and the respective Ready bit is set at the Status Register (for supporting the SR commands). Similar procedures are performed during a Page Program command, while during a Block Erase command no data are transferred at the NAND Flash interface but the size of the data affected in DRAM is high, since the block size is < pages per block > x < bytes per page >, usually in the range of a few MBytes. An important decision at the design level was related with the selection of the proper data width of the DRAM Access Module’s DMAs, in order to achieve proper data rate, RDMA , which is bounded by the required response time and the used clock. Since RTMC-NFE is based on PCIe-Gen 3 with either 8 or 16 lanes, PCIe is not the system’s bottleneck since the useful data rates surpass 6 and 12 GBps respectively. In the case of DDR with 200 MBps, 8 lanes PCIe-Gen 3 support up to 30 NAND Flash channels, while for DDR3 with 500 MBps, 16 lanes PCIe-Gen 3 support up to 20 fully loaded NAND Flash channels. The DMA data width affects the required FPGA resources for routing and that also affects the maximum achievable internal clock frequency. For the rest of this section, we use the following nomenclature: T R , T W and T BE are the read,

write and erase times and WR , WW and WBE are the data widths of the respective DMAs. PpB is the number of pages per block and L is the page length (in Bytes). Rclk is the frequency of the internal clock, Tˆ R/W is the time required for a DMA to transfer a complete page and assuming that the overhead required for setting up a DMA is much less than the actual transfer time, the following equations hold: T BE > T W >> T R ,

Tˆ W ≈ Tˆ R =

L Rclk RR/W ,

WR/W >

L Rclk T R/W

and WBE >

PpB.L Rclk T BE .

Table 2 shows some indicative values for NAND Flash devices and the minimum data width for supporting the respective response times. Each DRAM Access Module supports also more complicated commands, like copy-back, caching and multiplane, by exploiting the functionality of the three basic commands described above. For that purpose, double data buffers are used internally and dedicated caching modules have been included. Each data buffer has been implemented using dual-port block RAMs inside the FPGA and this double bus scheme allows updating the content of a buffer using data from/to the host DRAM while the other buffer is accessed to exchange data with the NAND Flash interface. 4.4. The Tracer Module As indicated in Fig.5 and Fig.6, the RTMC-NFE architecture uses a Tracer module for commands recording purposes. The Tracer module gets information from all NAND Flash Channels for all commands applied to the NAND Flash emulator and creates data structures that are stored in a dedicated area in the host’s DRAM. There are two types of commands: commands with data and commands without data. In both cases, a constant length header is used. The header contains information about the type of command, the channel/CE used, its parameters and the respective time stamp. During initialization, a memory space is reserved in the host’s DRAM for tracing purposes, and the Tracer fills that area sequentially, while local counters are updated and can be accessed by the Control Processor. 4.5. Data Structures and Memory Allocation A critical section for designing a NAND Flash emulation system is the allocation strategy used at the host DRAM memory. A command at the NAND Flash interface has to be associated with a specific memory area at the server’s DRAM, and this mapping is performed in the DRAM Access Module. Since the emulation system supports multiple channels, each DRAM Access Module accesses a specific memory region. In this work, we developed a dynamic memory allocation scheme, which is adapted to different NAND Flash memory configurations. For memory mapping, we use the following nomenclature: CHi is the channel number, CEi is the CE number, Li is the number of LUN, Bi is the block index and Pli is used for the plane. All variables start at 0. These variables, together with the known features of the memory under emulation, are sufficient for calculating the proper host DRAM address. BpP is the number of blocks per plane, NoC is the number of channels, C pC is the number of CEs per channel, LpC is the number of LUNs per CE, BlL is the block size, and PpL is the number of planes per LUN. We consider the memory allocation highlighted in Fig.7. This figure shows an example of how the memory is organized in a system that has two channels, each channel has four dies (CEs) and each die has two logical units (LUNs). The memory organization reflects the addressing used in the ONFI specs and results to a linear addressing space, for pages and for blocks. When multi-planes are used, the planes are interleaved, and the following equations provide the base addresses at any data structure within the device. If AHD is the base address of the host DRAM allocated to the NAND Flash emulator, LuL is the memory

required per LUN, and ALUN is the base address of LUN i, then the following equations apply: LuL = PpL.BpP.PpB.L ALUN = AHD + LpC.(C pC.CHi + CEi ) + Li .LuL BlockAddress = ALUN + (Pli + PpL.Bi ).BlL HostDRAMAddress = BlockAddress + Pi .L These equations have been implemented in each DRAM Access Module. During initialization, the Control Processor gets all system parameters of the emulated storage device from the emulator’s application that runs in the host and programs the registers of all DRAM Access Modules. When a Read/Write/Erase command is received by a DRAM Access Module, the system parameters are used along with the parameters of this specific command to determine the exact location of the page/block in the host DRAM and to program the DMA accordingly. Using this approach, no host intervention is required during the execution of a command and low latency is achieved. 5. Experimental Results In this section we present experimental performance results that demonstrate the behaviour of the developed RTMC-NFE, how fast pages are accessed in the host’s DRAM and what are the capabilities of its Tracer module. As already mentioned RTMC-NFE has been implemented on a FPGA-based board attached to the host’s motherboard using PCIe. The board supports PCIe Gen 3.0 with 8 lanes, has a DDR4 memory controller and a soft microprocessor. On the other hand, the host motherboard is equipped with an Intel Xeon Processor that has 12 cores and 2.20 GHz base frequency and 128GB memory which is split between the memory used by the OS (18GB) and the memory used for the NAND-Flash emulation process (112GB). Also, the host’s OS is Ubuntu 16.04.5 LTS. 5.1. Experimental Setups We tested and verified the proper functionality of RTMC-NFE in two different set-ups. The first set-up was a commercial USB Drive with a USB-Flash controller and a NAND Flash chip with two dies, while the second set-up was an Open SSD, that includes a customizable controller and a number of NAND Flash chips organized in two different channels with multiple memory banks per channel. The USB-Flash controller in the USB Drive implements the back-end protocols for communicating with the Host PC (USB, FTL) along with the front-end interfacing with the Flash Memory. This specific controller supports USB 2.0 with a high-speed mode of 480 Mbps and it is compatible with ONFI/Toggle NAND Flash Memory Interfaces with SDR/DDR data transfers, for SLC and MLC memory technologies. The maximum supported user page size supported is 16KB. In the USB Drive we emulated a NAND Flash memory chip that supports SDR and Toggle DDR1.0 NAND Flash Interfaces. The memory technology is MLC (MultiLevel Cell) and the memory supports two independent LUNs each. Every LUN has two planes, and each plane contains 1066 blocks, with 256 pages/block and 17KB raw page size. The typical access time for the Page Read command is 40 us, while the Page Program time is 1.4 msecs and the Block Erase time is 5 msecs. The selected memory module offers Cache functionalities for executing pipelined data transfers and is compatible with Multi-Plane commands issued by the controller. The Open SSD is an SSD development board that contains a controller and a number of NAND Flash chips. The SSD controller is based on an ARM processor, it is compatible with SATA 2.0 host interface (3Gbps) with NCQ (Native Command Queuing) support. The Open SSD’s controller can supports up to 16 CEs and is compatible with various NAND Flash memory chips, supporting a few tens of GB storage

memory. The NAND Flash memory chips emulated in this set-up are MLC, with single die per package, 8K user page size and 128 pages/block. The typical Page Read time is 250 usecs, while the Page Program and Block Erase times are respectively 1.3 and 1.5 msecs. 5.2. Power-up of a USB Drive In this subsection we present how the RTMC-NFE has been used to emulate a NAND Flash chip in a USB Drive and what happens inside the device when it is powered up. Initially we removed the NAND Flash chip from the USB Drive and we connected the RTMC-NFE in the pads of the memory chip. The RTMC-NFE’s DRAM was initialized with all FFs (as is the case in a virgin NAND Flash memory) and then a commercially available tool was used for setting up the device, based on Low-Level Format functions. Using this procedure the microcode of the USB Controller was initialized along with all internal tables of the storage device. This information was stored in emulated pages in the RTMC-NFE’s DRAM. Then the RTMC-NFE was deactivated and the DRAM’s content was stored in a binary file. The size of this file is equal to the raw capacity of the USB Drive. Then the RTMC-NFE was restarted and its DRAM was initialized using the above mentioned binary file. When the USB Drive was connected to a USB port, its controller started a boot-load process from its local ’NAND Flash memory’ (the RTMC-NFE in our case) and all commands where logged using the RTMC-NFE’s Tracer. Then the USB Device was recognized by the computer, where it was attached, and it starts behaving as a normal storage device, with all commands supported by RTMC-NFE and logged by its internal Tracer. Fig. 8a shows the sequence of commands during power up of the USB device, while in Fig. 8b are shown the indexes of the pages accessed during this procedure. The total power-up procedure lasts 70 msecs. The USB controller starts the NAND Flash interface in SDR mode, but after completing the basic boot-load procedure (at 26 msecs) it switches to DDR mode by applying the Set Feature command and the RTMC-NFE responds accordingly. 5.3. Latency on accessing DRAM for Page and Block access The RTMC-NFE has been tested under various loading conditions. Initially, we verified that it stores and retrieves the NAND Flash content correctly by applying various loading scenarios. In all cases, the data were transferred and stored properly. For evaluating the RTMC-NFE performance, we measured the response time of all commands under different testing scenarios and we studied the effect of the DMA data width on the achieved latency. For presentation purposes, we selected two widths (16 and 128 bits) as indicated in Table 2. Fig.9a shows experimental measurements for 10,000 Page Read commands using 128 bits data widths, while Fig.9b shows the distribution of the collected statistics. Most latency values were measured in the area of 5 usecs, and the probability to exceed 20 usecs is 2x10−4 . The NAND Flash emulator demonstrates robust behaviour and some sporadic responses with higher values are due to the effect of OS operations in the server’s DRAM memory. It is important to note that, despite these rare response spikes, the RTMC-NFE, with the proper data widths at its DMA engines, meets the timing requirements of any possible memory configuration. Fig.9c and Fig.9d show the cumulative distribution function (CDF) of the collected latency values for Page Read and Page Program commands under different loading conditions. In Fig.9c we study the effect of loading conditions at the host side. In the ’no Load’ scenario, we measure latency on Page Read commands when only emulation functions are executed. Due to the time multiplexing achieved over PCIe, latency is almost the same when both channels accept commands. In the ’Load’ scenario, we used an application that stores a huge file to the server’s DRAM while Page Read latency measurements were collected. In this case, we observe a slight increase on the measured latencies, but still, the emulator satisfies the NAND Flash timing specs. This slight latency increase is due to the fact that under heavy load conditions the data stored in the cache get invalidated more frequently, resulting in multiple cache misses and that affects latency.

Fig.9d shows the CDF of experimental measurements for consecutive Page Read and Page Program commands for 128 bits data width. The emulator demonstrates robust behaviour in both commands. Page Program shows slightly higher latency due to additional transactions over PCIe and some internal operations at the DRAM Access Module, but this does not affect the emulator’s behaviour since the typical Page Program time is a few hundreds of usecs, much higher than the measured response time. Although the Block Erase command has a more relaxed timing requirement, it is also related with much bigger data sizes. During a Block Erase command a large memory area has to be accessed, which is equal to the page size times the number of pages per block. This data size usually becomes a few MBytes. Latency measurements regarding the Block Erase command are presented in Table 3. In these experiments, we used page size of 18K and 128 pages/block, so the block size is 2.25 MBytes. Due to the maximum packet size of the used PCIe controller, Block Erase command is executed by using multiple PCIe transactions, and the maximum transfer rate is determined by the number of PCIe lanes used or by the maximum transfer rate supported by the DMA engine, whichever is the minimum. In our set-up, PCIe supports up to 6 GBytes/sec, while the DMA supports up to 4 GBytes/sec. This multiple packets approach allows efficient time multiplexing over PCIe when many DRAM Access Modules are used, and that guarantees proper response times as long as the total workload is less than the maximum transfer rate over PCIe. 5.4. Latency of the Tracer commands In this subsection, we present experimental results that demonstrate the performance of the Tracer module. For that purpose, we collected and analysed 1M Tracer commands. Each Tracer command contains at least a command header of 32 bytes, while for the Page Program commands the respective data are included (18K in these measurements). The response times are 250 nsecs and 36 usecs respectively. The experimental results are shown in Fig.10a and their statistics in Fig.10b and Fig.10c. In both cases, the system demonstrates robust performance with minimum variability. Concerning the Tracer module, we have to analyse also the capabilities of the whole tracing mechanism, in terms of number of supported commands, maximum duration of an experiment, sustained data rate and processing rate. We consider the case where 16 GBs have been allocated in the host DRAM from recording the tracer commands. The probability of tracing a Page Program command is dominant, since a Page Program command consumes the size of memory that is needed for a few hundreds of other commands. For that reason, all sub-figures of Fig.11 highlight the performance of the Tracer module versus the probability of a Page Program command. According to Fig.11a, the Tracer module capability ranges form almost 1M commands up to 80M commands. That represents an experiment duration from 15 minutes up to more than 24 hours and a processing rate ranging from 100 commands/sec up to 7K commands/sec. The sustained data rate ranges from 2 MBps up to 120 MBps. In all cases we assumed that the NAND Flash interface supports DDR at 200 MBps, the Page Program time is 800 usecs, the Page Read time is 50 usecs and the Block erase time is 1,500 usecs. 6. Wear leveling and Garbage Collection Mechanisms In this section we present how the Tracer of the NAND Flash emulator can be used to analyse higher level functions of storage systems, like wear-leveling and garbage collection. We consider the case of the USB Drive mentioned in the previous section. The USB Drive is a commercially available device and the wear-leveling and garbage collection functions are implemented in the USB device driver at the host side. We consider hosts using either Ubuntu or Windows 10 operating systems. The analysis we present is independent to the capacity of the USB device. The usual capacity ranges from a few tens of GBs up to a few hundreds of GBs. For decreasing the processing burden during analysis, we decreased the device’s

capacity to 2 GBs, by setting manually its parameters using a Low-Level Format tool. The device uses pages of 18KB (16KB for user data and 2K for metadata), organizes its blocks in two planes and there are 256 pages per block. In NAND Flash memories the pages can be accessed either for reading their content or for being programmed (if they have already been erased). Reprogramming can be performed only after an erase cycle, which is performed at the block level. Due to this restriction, and in order to maximize the lifetime of a NAND Flash device, wear-leveling and garbage collection functions are implemented [28], [29], [30]. Wear leveling is a technique that spreads the workload applied to a storage device evenly to all of its basic storage units, pages in the case of NAND Flash [31]. There are various wear leveling methods, static and dynamic, for associating logical block addresses (LBAs) to the physical block addresses (PBAs) of a Flash memory. In NAND Flash storage systems wear leveling is strongly associated with the mechanism of garbage collection. This mechanism is responsible for freeing memory areas, called garbage, that are no longer in use. In our experimental set-up both techniques are implemented in the device driver at the host side [32]. This is a basic characteristic in conventional file systems, such as FAT and NTFS, which were originally designed for rewriting data structures to the same storage area, especially when directory and other file system information is updated at the storage device. For the experimental results we present in Fig.12 and Fig.13, we used a workload of 5 files of 100 MB each, 100 files of 10 MB each and 500 files of 1 MB each, 2 GB in total. We performed two experiments in two different hosts, one that uses Windows 10 and the other that uses Ubuntu. The first experiment was performed after completing a Low-Level Format of the USB storage device. The results of this experiment are shown in Fig.12. For comparison purposes we present the results for both hosts under the same initial conditions. In the first three sub-figures of each column, we present the pages and blocks accessed during files saving, while in the last two sub-figures of each column we present statistics for the page offsets accessed. It seems that Ubuntu uses more frequently the first 64 pages of its blocks and wear leveling is not as uniform as in Windows 10. The same hold also in Fig.13, where the experiment was performed after storing and deleting the aforementioned workload many times before applying the same workload on an empty, but already stressed, device. Windows 10 and Ubuntu apply a different set of commands for performing the same high-level task. Accessing the content of a file can be achieved by using various NAND Flash commands. For example, the pages can be read using the mandatory Page Read command or optional commands like Multi-Plane Page Read or Cache Read command. So, depending on the commands that the OS uses, the NAND Flash is instructed to respond to different commands. Based on the collected information using the RTMC-NFE’s Tracer advanced information can be extracted, like the blocks used for storing metadata information, the pages used for updating the system tables, where the microcode is located etc. 7. Conclusions We presented the architecture, implementation details and experimental results of a flexible Real-time Multiple-Channels NAND Flash emulator that is based on PCIe FPGA boards. The overall system provides advanced capabilities regarding memory capacity and supported channels. The experimental results demonstrated the system’s reliability and its robust performance. the presented NAND Flash emulator can also be used along with advanced software tools for analysing the behaviour of higher level functions related to data storage [11].

References References [1] R. Bez, E. Camerlenghi, A. Modelli, A. Visconti, Introduction to Flash memory, Proceedings of the IEEE 91 (4) (2003) 489–502. [2] T. Li, Z. Lei, A novel multiple dies parallel NAND Flash Memory Controller for high-speed data storage, in: 2017 13th IEEE International Conference on Electronic Measurement Instruments (ICEMI), 2017, pp. 6–11. [3] I. Oukid, L. Lersch, On the diversity of memory and storage technologies, Datenbank-Spektrum 18 (2) (2018) 121–127. [4] C. M. C. etc, Reviewing the evolution of the nand flash technology, Proceedings of the IEEE 105 (9) (2017) 1609 – 1633. [5] R. G. Jung Yoon, A. Walls, 3D NAND Technology Scaling helps accelerate AI growth, Flash Memory Summit, Santa Clara, August 2018. [6] G. W. B. etc., Recent progress in phase-change memory technology, IEEE Journal on Emerging and Selected Topics in Circuits and Systems 6 (2) (2016) 146–162. [7] Open NAND Flash Interface Specification, revision 4.1, ONFI Workgroup, 2017. [8] Embedded Multimedia Card - eMMC v5.1A - JESD84-B51A, JEDEC, 2019. [9] H. Chen, UFS 3.0 Controller Design Considerations, JEDEC, 2017. [10] M. Jung, W. Choi, S. Gao, E. H. Wilson III, D. Donofrio, J. Shalf, M. T. Kandemir, NANDFlashSim: High-Fidelity, Microarchitecture-Aware NAND Flash Memory Simulation, ACM Transactions on Storage 12 (2) (2016) 6:1–6:32. [11] J. Yoo, Y. Won, J. Hwang, S. Kang, J. Choil, S. Yoon, J. Cha, VSSIM: Virtual machine based SSD simulator, in: 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), 2013, pp. 1–14. [12] K. T. Malladi, M.-T. Chang, D. Niu, H. Zheng, FlashStorageSim: Performance Modeling for SSD Architectures, in: 2017 International Conference on Networking, Architecture, and Storage (NAS), IEEE, Shenzhen, China, 2017. [13] J. Shin, J. Bae, A. Na, S. L. Min, Copycat: A High Precision Real Time NAND Simulator (2016). arXiv:1612.04277. [14] J. Brewer, M. Gill, Nonvolatile Memory Technologies with Emphasis on Flash: A Comprehensive Guide to Understanding and Using Flash Memory Devices, Vol. 8, Wiley. com, 2011. [15] K. Kanda, N. Shibata, T. Hisada, K. Isobe, M. Sato, Y. Shimizu, T. Shimizu, T. Sugimoto, T. Kobayashi, N. Kanagawa, Y. Kajitani, T. Ogawa, K. e. Iwasa, A 19 nm 112.8 mm2 64 Gb Multi-Level Flash Memory with 400 Mbit/sec/pin 1.8 V Toggle Mode Interface, Solid-State Circuits, IEEE Journal of 48 (1) (2013) 159–167.

[16] NAND Flash Interface Interoperability, Standard, JEDEC Solid State Technology Association (July 2014). [17] A. Prodromakis, S. Korkotsides, T. Antonakopoulos, A Versatile Emulator for the Aging Effect of Nonvolatile Memories: The Case of NAND Flash, in: Proceedings - 2014 17th Euromicro Conference on Digital System Design, 2014, pp. 9–15. [18] Intel Corporation, Intel Xeon Processor E5-1600/E5-2600/E5-4600 v2, Product Families Datasheet, Volume 1. 329187-003 (March 2014). [19] B. Bevin, How Memory Is Accessed, Tech. rep. (June 2016). URL https://software.intel.com/en-us/articles/how-memory-is-accessed [20] M. Hillenbrand, Physical Address Decoding in Intel Xeon v3/v4 CPUs: A Supplemental Datasheet, Tech. rep., Karlsruhe Institute of Technology (September 2017). [21] Intel Corporation, Intel Xeon Processor 7500 Series, Datasheet, Volume 2. 323341-001 (2010). [22] Intel Corporation, Intel Xeon Processor E5 and E7 v4, Product Families Uncore Performance Monitoring Reference Manual. 334291-001US (April 2016). [23] W. Y. Yi Zhang, Nan Guan, Understanding the Dynamic Caches on Intel Processors: Methods and Applications. [24] E. Bougioukou, A. Ntalla, A. Pali, M. Varsamou, T. Antonakopoulos, Prototyping and Performance Evaluation of a Dynamically Adaptable Block Device Driver for PCIe-based SSDs, in: The 25th IEEE International Symposium on Rapid System Prototyping, 2014, pp. 51–57. [25] ARM, AMBA AXI and ACE Protocol Specification (2011). [26] Xilinx, SmartConnect v1.0 LogiCORE IP Product Guide (December 2017). [27] Xilinx, AXI Bridge for PCI Express Gen3 Subsystem v2.0 Product Guide (November 2015). [28] R. Subramani, H. Swapnil, N. Thakur, B. Radhakrishnan, K. Puttaiah, Garbage Collection Algorithms for NAND Flash Memory Devices – An Overview, 2013, pp. 81–86. [29] S. Mittal, J. Vetter, A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems, IEEE Transactions on Parallel and Distributed Systems 27. [30] Micron Technology, Wear-Leveling Techniques in NAND Flash Devices (TN-29-42) (2008). [31] Y. Jin, B. Lee, A comprehensive survey of issues in solid state drives, in: Advances in Computers, Elsevier, 2019, pp. 1–69. [32] J. Axelson, USB Mass Storage: Designing and Programming Devices and Embedded Hosts, Lakeview Research LLC, 2006.

[Table 1 about here.] [Table 2 about here.] [Table 3 about here.]

[Figure 1 about here.] [Figure 2 about here.] [Figure 3 about here.] [Figure 4 about here.] [Figure 5 about here.] [Figure 6 about here.] [Figure 7 about here.] [Figure 8 about here.] [Figure 9 about here.] [Figure 10 about here.] [Figure 11 about here.] [Figure 12 about here.] [Figure 13 about here.]

Biography N.Toulgaridis([email protected]) received the Diploma in “Electrical and Computer Engineering” and the MSc degree in “Hardware and Software Integrated Systems” both in” the University of Patras. He is now a Ph.D. candidate in the same University. His primary research interests lie in the areas of highperformance computing, embedded systems, and storage.

MICPRO_2019_147 Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

Professor Theodore A. Antonakopoulos University of Patras Department of Electrical and Computers Engineering 26504 Rio - Patras, Greece Tel: +30 (2610) 996 487, e-mail: [email protected] Web: http://www.loe.ee.upatras.gr/English/People/Antonakopoulos.htm

List of Figures 1 2 3 4 5 6 7 8 9 10 11 12 13

The NAND Flash IC internal structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Accessing DRAM in Xeon servers from a PCIe card [18]. . . . . . . . . . . . . . . . . . . . The RTMC-NVM Emulator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . The extended RTMC-NVM Emulator Architecture . . . . . . . . . . . . . . . . . . . . . . The NAND Flash Emulator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . The main RTMC-NFE architecture: multiple NAND Flash Channel interfaces, multiple DRAM Access modules, a single Tracer module and a control processor. . . . . . . . . . . . Memory organization for two NAND Flash channels with two CEs per channel and two LUNs per CE and two planes per LUN. The blocks of the two planes are interleaved. . . . . The commands executed in a USB Drive during power-up. . . . . . . . . . . . . . . . . . . Analysis of Page access latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of tracer response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance of the emulator’s tracer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The statistics of Block Erase, Page Program and Page Read commands when the USB storage device is filled up with a number of files of different sizes after Low-Level Format (Windows 10 at the left column, and Ubuntu at the right). . . . . . . . . . . . . . . . . . . . The statistics of Block Erase, Page Program and Page Read commands when an empty USB storage device is filled up with the same files as in Fig.12. The device has been filled up and cleaned multiple times before applying this workload (Windows 10 at the left column, and Ubuntu at the right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1: The NAND Flash IC internal structure.

Figure 2: Accessing DRAM in Xeon servers from a PCIe card [18].

Figure 3: The RTMC-NVM Emulator Architecture

Figure 4: The extended RTMC-NVM Emulator Architecture

Figure 5: The NAND Flash Emulator Architecture

Figure 6: The main RTMC-NFE architecture: multiple NAND Flash Channel interfaces, multiple DRAM Access modules, a single Tracer module and a control processor.

Figure 7: Memory organization for two NAND Flash channels with two CEs per channel and two LUNs per CE and two planes per LUN. The blocks of the two planes are interleaved.

Figure 8: The commands executed in a USB Drive during power-up.

Figure 9: Page access latency: (a) Page Read experimental measurements, (b) Distribution of Page Read latencies, (c) Page Read CDFs for different loading conditions and (d) Page Read and Page Program latency CDFs.

Figure 10: Tracer commands (a) Experimental measurements, (b) Distribution of tracer commands without data, (c) Distribution of tracer commands with a whole page data.

Figure 11: Performance of the emulator’s tracer versus the probability of Page Program commands (tracer DRAM: 16GB): (a) Number of supported commands, (b) Processing Rate [cmds/sec], (c) Emulation continuous time and (d) Data Rate [MBps].

Figure 12: The statistics of Block Erase, Page Program and Page Read commands when the USB storage device is filled up with a number of files of different sizes after Low-Level Format (Windows 10 at the left column, and Ubuntu at the right).

Figure 13: The statistics of Block Erase, Page Program and Page Read commands when an empty USB storage device is filled up with the same files as in Fig.12. The device has been filled up and cleaned multiple times before applying this workload (Windows 10 at the left column, and Ubuntu at the right).

List of Tables 1 2 3

NAND Flash Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . NAND Flash devices: Characteristics and data widths. . . . . . . . . . . . . . . . . . . . . Block Erase latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 1: NAND Flash Interfaces

Type SDR Toggle 1 Toggle 2 ONFi 2 - DDR ONFi 3 - DDR2 ONFi 4 - DDR3

Transfer Rate [MTps] 50 133 400/533 200 400/533 667/800

I/O Voltage [Volts] 3.3 3.3 3.3/1.8 3.3 3.3/1.8 1.8/1.2

Table 2: NAND Flash devices: Characteristics and data widths.

NAND Flash M#1 M#2 M#3 M#4

TR (us) 25 40 50 75

TW (us) 230 1,300 1,400 1300

T BE (ms) 0.7 3.0 5.0 15

L (B) 4,320 8,640 17,664 18,592

PpB P/Bl 128 128 256 1024

WR/W (B) 1 2 2 2

WBE (B) 4 2 4 8

Table 3: Block Erase latency

CDF Latency [usecs]

50% 557

90% 564

99.9% 569