A novel memory management method for multi-core processors

A novel memory management method for multi-core processors

ARTICLE IN PRESS JID: CAEE [m3Gsc;December 14, 2015;9:11] Computers and Electrical Engineering 000 (2015) 1–11 Contents lists available at Science...

1MB Sizes 164 Downloads 97 Views

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;December 14, 2015;9:11]

Computers and Electrical Engineering 000 (2015) 1–11

Contents lists available at ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

A novel memory management method for multi-core processors ✩ Jih-Fu Tu∗ Department of Electronic Engineering, St. John’s University, 499, Sec. 4, Tamking Road, Tamsui District, New Taipei City, Taiwan

a r t i c l e

i n f o

Article history: Received 31 May 2015 Revised 21 October 2015 Accepted 26 October 2015 Available online xxx Keywords: Multicore system-on-chip hardwired

a b s t r a c t This study examines a multicore processor based on a system-on-chip (SoC) and configured by a Tensilica Xtensa® LX2. The multicore processor is a heterogeneous, configurable dual-core processor. In this study, one core was used as the host to control the processor chip, and the other was used as a slave to extend digital signal processing applications. Each core not only owned its local memory, but also shared common data memory. In addition, the proposed multicore processors had a virtual memory. This additional memory supported the processor by enabling it to easily manage complex programs; it also allowed the two cores to access data from the unified data memory of different tasks. For bus management, a bus arbitration mechanism was added to handle the cores and to distribute the priority of asynchronous access requests. The benefits of the proposed structure include avoiding hardwired memory and reducing interface handshaking. To verify the proposed processor, it was simulated on the model level using a Petri net graph, and on the system level using ARM SoC designer tools. In the performance simulation, we found that the lowest latency-to-cost ratios were achieved using a 32-bit bus interface and a 4-entry data queue. © 2015 Elsevier Ltd. All rights reserved.

1. Introduction In terms of embedded systems, the multiply function in handheld devices often encounters issues related to power consumption, speed, and heat. The performance of these devices is mainly dependent on processor clock frequency; however, to increase performance, a higher current is required to drive the hardwired logic. With the decreasing sizes of their silicon areas and their need to conserve power, portable devices are often unable to achieve faster frequencies and higher performance. Contrary to expectations, more circuits have been added, which increases power consumption. To meet new requirements, we need new solutions to these problems. Many multimedia handheld devices now contain multicore processors, and execute system control and multimedia computing from separate cores. In 2014, [1] described a quasi-partitioning scheme for last-level caches that combined memory-level parallelism, cache friendliness, and the interference sensitivity of competing applications, to efficiently manage the shared cache capacity. Although a heterogeneous core may reduce burdens by executing predefined tasks in advance, maintaining these two sets of development environments often requires significant costs and labor. However, future processor systems will increasingly utilize parallel processing; multithreaded dispatchers will manage performance. This allows the operating system (OS) to access available processors to manage the OS, communications, and multimedia, and reduces the need to increase CPU clock speed. In the past, it was ✩ ∗

Reviews processed and recommended for publication to the Editor-in-Chief by Guest Editor Dr. T-H Meen. Tel.: +886 930038679. E-mail address: [email protected]

http://dx.doi.org/10.1016/j.compeleceng.2015.10.009 0045-7906/© 2015 Elsevier Ltd. All rights reserved.

Please cite this article as: J.-F. Tu, A novel memory management method for multi-core processors, Computers and Electrical Engineering (2015), http://dx.doi.org/10.1016/j.compeleceng.2015.10.009

JID: CAEE 2

ARTICLE IN PRESS

[m3Gsc;December 14, 2015;9:11]

J.-F. Tu / Computers and Electrical Engineering 000 (2015) 1–11

Fig. 1. Basic structure of the multicore design.

difficult for multicore architectures to fully satisfy these application demands simultaneously. We design a multiprocessor model that supports embedded system applications. It requires that an instruction set architecture be integrated into a multiprocessor system-on-chip (MPSoC) based on configurable cores. This paper describes a configurable multicore structure. In this study, we implemented a dual-core system and a shared-memory bus arbitration method that handles shared-memory accesses; the basic structure is shown in Fig. 1. Other components (such as hardware accelerators or other cores) can be added to this simulator system for co-simulation, provided they contain a shared-memory bus interface. We exploited the parallelism in random-access schemes to implement a multiple-core processor (shown in Fig. 1) that contains multiple function units. These include the microprocessor, switch engineer for arbitrating the use of the assigned bus, and the shared-memory module, which is globally addressable and can be distributed among the microprocessors or centralized in one place. We used a Tensilica Xtensa® LX2 [10] to design the multicore processors. When collocated with Diamond standard processors, which can provide superior processing capability compared to traditional designs, the LX2 benefits in terms of area and power characteristics. This implementation can be configured in a multiprocessor architecture, and verified by a provided simulation program. The main work of systems integration includes the multicores, local memory, cache, shared memory, and shared bus. The remainder of this paper is organized as follows. Section 2 surveys previous related studies. The design methodology is introduced in Section 3. Section 4 presents the results of the proposed processor. Finally, we present the conclusions in Section 5. 2. Background There is increasing demand for embedded multimedia communication systems in mobile and portable device applications. For multimedia communications, the implementation of audio and video compression standards is essential. In addition, a system demanding improved performance requires a higher clock frequency. As such, multiple-function handheld devices are often challenged by problems related to power consumption, clock speed, and heat dissipation. 2.1. Related works In many multicore in-order processing systems, only one core can be used when the instruction at the head of the queue produces data input for the next instruction in the queue [2]. To achieve higher performance and flexibility, hybrid architecture has been proposed. In this architecture, operation-intensive functions are implemented with hardwired blocks, and other less complex functions are implemented with software executed by an application-specific instruction processor. Current multimedia handheld devices often use built-in multicore approaches [3]. System control and multimedia computations are executed using separated cores. Although heterogeneous cores reduce burdens from predefined tasks [4], maintaining two sets of development environments requires significant costs and labor. For example, Bernabe et al. [5] used the compute unified device architecture (CUDA) and the CUDA basic linear algebra subroutines library, which were tested on two different graphics processor unit architectures, to design a hyper spectral data cube. Shnaiderman and Shmueli [6] presented parallel path stack (PPS) and parallel twig stack (PTS) algorithms. The PPS and PTS algorithms are novel and efficient for matching extensible markup language query twig patterns in a parallel, multithreaded computing platform [6]. Chang et al. [7] explored the joint considerations of memory management and real-time task Please cite this article as: J.-F. Tu, A novel memory management method for multi-core processors, Computers and Electrical Engineering (2015), http://dx.doi.org/10.1016/j.compeleceng.2015.10.009

JID: CAEE

ARTICLE IN PRESS J.-F. Tu / Computers and Electrical Engineering 000 (2015) 1–11

[m3Gsc;December 14, 2015;9:11] 3

scheduling in an island-based multicore architecture, and found that the access time provided by the local memory module of an island was shorter than that of the global memory module. A parallel discrete event simulation (PDES) can substantially improve the performance and capacity of simulations, allowing for the study of larger, more detailed models in less time. The PDES is a fine-grained parallel application whose performance and scalability is limited by communication latencies [8]. In [9], the authors described the development of a hardware mechanism designed to improve synchronization in a multicore architecture, using Petri formalisms. A module that interfaced with two MicroBlaze processors was developed. For memory and cache management, Garcia–Guirado introduced in-cache coherence information with respect to traditional bit-vector directories to use modemizaton multicore processors [10]. 2.2. Xtensa LX2 processor The Tensilica Xtensa LX2 [11] is a reduced instruction set computer (RISC) microprocessor with two series processors: Xtensa, a configurable and extensible microprocessor, and Diamond, a standard fixed-core microprocessor. The configurable Xtensa processor architecture is designed specifically for system-on-chip (SoC) usage. The Diamond is generally compatible with the Xtensa core, but cannot be further configured because it is hardwired. Xtensa LX2 [11] cores are designed for SoC applications; they can easily configure hardwired processors and extend any customized instructions. New processor cores can be synthesized, unlike traditional embedded processors or digital signal processing (DSP) cores, which cannot be configured. It is for this reason we chose the Xtensa LX2 processor for this study. 2.2.1. Xtensa configurable hardwired logic In addition to the above, some attached DSP configurations include the Vectra DSP and audio DSP engines. Other hardwired logic designs are available and configurable 1. 2. 3. 4. 5. 6. 7. 8. 9.

Configurable multiplier–accumulator and MUL. Floating-point units, registers, state, and interface. Data path configuration. DSP instructions. Tensilica instruction extension (TIE) 16- and 24-bit instructions. One or two load store units. Five or seven pipeline stages. Local memory and cache size configurations. Timers, interrupt vectors, and exceptions.

2.2.2. Audio DSP engine The Tensilica Xtensa HiFi II audio engine has been precompiled into software packages suitable for digital audio formats. Its performance is closer to traditional, fixed hardwired logic, but it maintains the flexibility of code upgrades. The Xtensa HiFi II audio engine is based on the benchmark Tensilica Xtensa LX2 processor. This audio engine can be customized to increase code density and reduce processing cycles; it can also provide flexible-data path widths (to increase efficiency for high-end audio), VLIW (very long instruction word) numbers of instructions per cycle, parallel data computing (SIMD parallel instructions), low power consumption, and minimized gate count (approximately 78 k gates). Because this DSP engine is not hardwired, it can be transferred to any necessary 24-bit audio SoC applications, and therefore fits well with extended applications. 3. Design method Our design is based on the Xtensa 32-bit RISC and DSP architecture, owing to its 16- or 24-bit instruction set architecture, relatively low power consumption, high code density, and small silicon area. The architecture diagram of the proposed multicore design is shown in Fig. 2. The specification details are described in the following subsections. 3.1. Structure 1. Processor model a. Configurable processor: Xtensa LX2. b. Standard processor: DC330HiFi. 2. Memory model a. System memory: 4 GB. b. Local memory: 256 KB. c. Cache: 16 KB is combined in 2-way, 16 entries. d. Memory management: region protection and memory management unit (MMU). 3. Bus model a. Bus interface: 64-bit. b. Arbitration mechanisms: centralized parallel arbitration. Please cite this article as: J.-F. Tu, A novel memory management method for multi-core processors, Computers and Electrical Engineering (2015), http://dx.doi.org/10.1016/j.compeleceng.2015.10.009

JID: CAEE 4

ARTICLE IN PRESS

[m3Gsc;December 14, 2015;9:11]

J.-F. Tu / Computers and Electrical Engineering 000 (2015) 1–11

Fig. 2. Main structure diagram.

Fig. 3. Multicore architecture diagram.

3.2. Multiprocessor descriptions We used heterogeneous cores in the main structure to optimize system performance. We considered the processor’s performance and its system applicability, including whether the standard processors were simple and provided sufficient capability. If a configurable processor is added, it will adapt to changing requirements. The architecture of the multicore architecture is shown in Fig. 3. In the MPSoC architecture, a common memory space is shared, whereas architectures with their own processor interfaces use an arbitration mechanism to grant access to a common bus. This processor includes one RISC and one DSP for voice encoding and decoding, because heterogeneous cores provide higher energy efficiency, and software migration is easier. 3.2.1. Processor model 1. Configurable processor: Xtensa LX2. 2. Standard processor: DC330HiFi. 3.3. Memory design This proposed system uses a common memory area (shared memory), as shown in Fig. 4. After the memory is configured, each different memory-size configuration executes the same procedures and produces the same results, but at different speeds. The width and size of memory units can be configured to meet the accuracy and bandwidth requirements for data. However, if a very large memory configuration is built, the costs will be higher than they would be in the core. Cache memory speeds are configured to improve the performance of local memory. When memory or cache is configured, the effect on their processors is considered. In terms of performance, the configured memory size may cause tradeoffs in area, Please cite this article as: J.-F. Tu, A novel memory management method for multi-core processors, Computers and Electrical Engineering (2015), http://dx.doi.org/10.1016/j.compeleceng.2015.10.009

JID: CAEE

ARTICLE IN PRESS J.-F. Tu / Computers and Electrical Engineering 000 (2015) 1–11

[m3Gsc;December 14, 2015;9:11] 5

Fig. 4. Memory structure.

Fig. 5. Cache and local memory.

power consumption, and speed. Cache misses will consume a higher bandwidth; thus, we adjusted the cache and local memory configurations to optimize performance (Fig. 5). Memory management is performed by the MMU configuration. If the memory configuration adds a virtual memory unit, the processor can handle programs of greater complexity. If it does not have virtual memory, application programs will be located in fast memory. We used an MMU to verify the effect of system resources. 3.3.1. Local memory model The size of local memory is organized as: 1. 2. 3. 4. 5. 6.

Inst-RAM [0,2] size: 128 KB. Inst-ROM [0,1] size: 256 KB. Data-RAM [0,2] size: 128 KB. Data-ROM [0,1] size: 256 KB. RAM access width: 64 bits. RAM access latency: 1 cycle of access latency.

3.3.2. Cache model The cache is defined as follows: 1. Instruction-cache size: 16 KB. 2. Data cache size: 16 KB. Please cite this article as: J.-F. Tu, A novel memory management method for multi-core processors, Computers and Electrical Engineering (2015), http://dx.doi.org/10.1016/j.compeleceng.2015.10.009

JID: CAEE 6

ARTICLE IN PRESS

[m3Gsc;December 14, 2015;9:11]

J.-F. Tu / Computers and Electrical Engineering 000 (2015) 1–11

3. 4. 5. 6. 7. 8. 9.

Cache write policy: alternating between write-back and write-through for the data cache. Cache replacement algorithms: least recently filled policy. Write buffer: 16 entries. Associative: 2-way set associative caches. Line size: cache locking per line, line size of 64 bytes. Cache access width: 64 bits. Cache memory access latency: 1 cycle of access latency.

3.3.3. Memory mapping The specific purpose functions of this proposed multicore processor are allocated to the following memory addresses: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

Data RAM0 Data RAM1 RAM0 System ROM Reset vector System RAM Double exception Window vectors L5 interrupt L4 interrupt L3 interrupt L2 interrupt L1 (user interrupt) Kernel exception Debug exception Nonmaskable interrupt

0×3FFE 0000 0×3FFC 0000 0×4000 0000 0×5000 0000 0×5000 0000 0×6000 0000 0×6000 03C0 0×6000 0000 0×6000 0240 0×6000 0200 0×6000 01C0 0×6000 0180 0×6000 0340 0×6000 0300 0×6000 0280 0×6000 02C0

3.3.4. Memory management model (a) Region protection • Memory region: the 4G space is divided into 8 equally sized regions with 512 MB. • Memory region access mode: through access mode setting. Bypass, allocate, no allocate, write-back, and write-through. (b) Memory management unit • Memory management terms include the page table entry, isolate, identify map, static, wired, auto-refill, ring, and address space identifiers. • The translation lookaside buffer (TLB) is combined to an instruction TLB and a data TLB; both have 1-way with eight entries. • MMU for OS: Linux provides demand paging and memory protection. 3.4. Bus The processor bus is an essential interface in which the channel width and access speed are independently configurable. We designed this bus with a master–slave architecture. If a master device needs to communicate with another, it must send a request signal to the arbitration mechanism; when permission is obtained, the master device gains the right to bus access. By contrast, the slave device can never send a request signal; it can only wait for requests. The bus arbitration mechanism is a central parallel arbitration scheme; when two or more masters simultaneously send a request, the arbitration mechanism sends a permission response in accordance with predefined priority codes. In addition to considering communications methods, we consider using a common bus (shared bus) that is capable of serving many devices by means of arbitration. The bus interface requires a suitable bandwidth to meet high-performance hardwired needs. The bus architecture is shown in Fig. 6. 3.4.1. Bus model (a) Bus interface • Width: 128 bits. • Interrupted: 32 interrupters. • Slave device interface. • Master device interface. (b) Arbitration mechanism • System-interrupt mechanism. • Error handling mechanism. • Synchronization mechanism. Please cite this article as: J.-F. Tu, A novel memory management method for multi-core processors, Computers and Electrical Engineering (2015), http://dx.doi.org/10.1016/j.compeleceng.2015.10.009

JID: CAEE

ARTICLE IN PRESS J.-F. Tu / Computers and Electrical Engineering 000 (2015) 1–11

[m3Gsc;December 14, 2015;9:11] 7

Fig. 6. Bus architecture.

3.5. Arbitration mechanism Our experiment synchronizes one SimpleScalar module [14] and one Xtensa instruction set simulator (ISS) running in a cycleby-cycle round-robin approach. The safest approach to dual-core synchronization is round-robin, which allows every processor to alternately run cycle by cycle. A method is needed for inter-core synchronization, to achieve favorable communication performance while guaranteeing accuracy. In this study, we designed a communication-based synchronization approach in which synchronization between dual cores is achieved when communication is necessary. We integrated this synchronization mechanism into the arbitration mechanism. When a shared-memory access instruction from the processors is decoded, the shared-memory access request and the simulation cycle count are sent to the bus arbiter. Arbitration compares the current cycle number of all processing components in the system, and grants the component with the lowest cycle number access to the shared memory. 3.6. Communication mechanism We use the shared memory as the inter-processor communication medium to facilitate the synchronization scheme in the proposed framework. Any processing component that wants to access the shared memory should implement special load and store instructions. We assumed that all processing components were connected to the shared memory using a shared-memory bus. A dedicated bus can ensure fast access and communications. Concurrently, bus arbitration is provided to avoid conflict. The shared-memory module was written in SystemC. To facilitate parallel computing tests, we provide a library of shared-memory related communication services. The services are designed for shared-memory allocation, deletion, and mail box services. Other services, such as a semaphore and message queue, will be added in the near future. We also used a bus interface to manage request messages and desired data transmissions among cores. Fig. 7 illustrates the communication topology for the multicore architecture. Data is transferred between cores according to the send () program, which sends data from the desired processor to the data buffer of the bus interface, and the receive () subroutine, which transmits the data from the buffer to the requesting processor. According to a top-down design methodology, we devised the specification at the top level. We refined the specifications until we reached the implemental level. The main objective of the first design stage was to produce labeled Petri nets in which transitions were labeled only when there were actions in corresponding modules. During the second stage, we transformed this high-level labeled Petri net [12, 13] into one that contained explicit transactions for control elements, and could therefore be translated into a circuit. We started with the initial transition of Petri nets shown in Fig. 8. This followed the most abstract specification of the pipeline processor’s operation: it alternated between user- and supervisor-level context-switch modes through the interrupt handler. Thus, the initial specification was simply a labeled Petri net with transitions representing those modes. Using ARM SoC designer tools, we created a simulation model at the system level, as shown in Fig. 9. 4. Results Xtensa Xplorer is a software development toolkit with an integrated graphical user interface. It is used for processor creation, simulations, profiles, and debugging and analyzing program code. Tensilica’s Xstensa C and C++ compiler provide advanced optimizations, such as profile-directed feedback compilation, process optimization, software pipeline analysis, static Please cite this article as: J.-F. Tu, A novel memory management method for multi-core processors, Computers and Electrical Engineering (2015), http://dx.doi.org/10.1016/j.compeleceng.2015.10.009

JID: CAEE 8

ARTICLE IN PRESS

[m3Gsc;December 14, 2015;9:11]

J.-F. Tu / Computers and Electrical Engineering 000 (2015) 1–11

Fig. 7. Transfer algorithm for the bus architecture.

Fig. 8. Petri net model for the bus architecture.

Please cite this article as: J.-F. Tu, A novel memory management method for multi-core processors, Computers and Electrical Engineering (2015), http://dx.doi.org/10.1016/j.compeleceng.2015.10.009

JID: CAEE

ARTICLE IN PRESS J.-F. Tu / Computers and Electrical Engineering 000 (2015) 1–11

[m3Gsc;December 14, 2015;9:11] 9

Fig. 9. System-level simulation model based on ARM SoC designer tools.

single-assignment optimization, and reduced code size. The Xplorer development environment produces graphical results by using Xtensa ISS. The program code can accurately model information such as cache performance, execution cycles, branches, exceptions, pipeline states, and results, and present them in forms and graphics. 4.1. Modeling protocol The MPSoC design increases the number of cores on one chip. Based on the Tensilica Xtensa Modeling Protocol (XTMP), it provides a database-type application programming interface (API) to the ISS. This allows developers to create complex hardwired systems established on an original model and obtain results from the simulation, with the processor completing the original concept of the model. Tensilica’s XTMP is a set of software development tools that can create customized multithreaded simulations. The XTMP provides a type of database API to the directive ISS. A quick and accurate simulation of SoC design consists of one or more processor cores. The content is used to simulate the XTMP multiprocessor subsystem, or a complex structure of a single processor. It can be initialized by multiple processors, and simulates XTMP by linking all their custom peripheral devices. XTMP is used early in the design to debug, profile, and validate the integration of SoC and software architecture. Because the XTMP simulator executes at a higher level than that of hardware description language simulators, simulation time can be drastically reduced. 4.2. Energy verification methods Tensilica Xenergy is an energy estimation tool dedicated to the initial stages of development, providing evaluations of power and energy consumption. With this tool, the correct Diamond or Xtensa processor configuration options can be selected. Another feature is the ability to co-develop with TIE instructions to optimize Xtensa energy estimations. Xtensa processors support TIE instructions to reduce the number of instructions needed for execution, reduce the overall energy consumption, and upgrade the clock frequency requirements. Xenergy allows developers to forecast which application code executes on the processor’s local memory and cache, enabling them to quickly choose between size, performance, and energy. The processor’s local memory includes the instruction cache (including labels and data arrays), instruction RAM, instruction ROM, data cache (including labels and data arrays), data RAM, and data ROM. 4.3. Instruction simulation methods We proposed an Xtensa configurable structure through the configuration of the core to optimize the performance of embedded applications. The Xtensa ISS is a standard software development kit that allows developers to create new SoCs, perform debugging, and fine-tune the performance of applications; it could also be a hardware simulator, which may be used as a reference model for verification tools. The ISS provides cycle and performance information; by analyzing different processor configurations, it can help researchers choose the optimal configuration options for assessing complex instruction set simulation results. 4.4. Performance analysis The performance analysis consisted of two phases: the latency of the bus interface and the latency of the cache, where the UT is the unit time of a simulation platform equipped with an Intel i7 CPU 870 @ 2.93 GHz. An efficiency appraisal procedure using Please cite this article as: J.-F. Tu, A novel memory management method for multi-core processors, Computers and Electrical Engineering (2015), http://dx.doi.org/10.1016/j.compeleceng.2015.10.009

JID: CAEE 10

ARTICLE IN PRESS

[m3Gsc;December 14, 2015;9:11]

J.-F. Tu / Computers and Electrical Engineering 000 (2015) 1–11

Fig. 10. Latency of different bus interface sizes.

Fig. 11. Latency of different cache sizes.

the MediaBench suite [15, 16] was used. The appraisal procedures included image, sound, and communication compiled-code procedures. First, we showed the communication latency for different bus interface capacities: 8-bit, 16-bit, 32-bit, 64-bit, and 128-bit. Referring to Fig. 10, we found the lowest latency-to-cost ratio, defined as the bus interface size divided by the reduced time, of each core was 32-bit, on average. The reason is that the request is issued from the requesting processor to the bus interface, which then issues an interrupt signal to the requested processor. Thus, any requests must expend waiting time in the bus interface. When the request queue size was larger than 32-bit, such as 64- or 128-bit, the latency time was not conspicuously improved for the cost. Second, we analyzed the communication latency time of different size caches. Referring to Fig. 11, we found the lowest latency-to-cost ratio in the 4-entry case. This is because the data needed by the requesting processor is transmitted from the required processor to the cache, delivered to the bus interface, and then immediately transmitted to the requesting processor. Thus, the needed data does not expend any waiting time in the cache. The maximum number of entries is referred to as the designed cache. 5. Conclusions To support SoC designs with increasing numbers of processors, a Tensilica XTMP is used to provide a type of database API to the directive simulator ISS; this allows developers to configure complex hardware systems to provide the original model and simulation results, and to accomplish the original processor model concept. This ISS accumulates a statistical summary of the micro-architecture unit by using a simple instruction set architecture. The displayed information includes the memory model and the instruction set architecture summary. The memory model report includes relevant information on the local and cache memory. To simplify, the data memory structure must presume a memory configuration and simulate the impact of speed. Through this design flow system, costs can also be reduced. Therefore, it provides a relative energy analysis between processor configurations, and chooses different configurations or tunes the application code to reduce processor energy consumption through this analysis, the estimator adjusts area, speed, and power, and creates an optimization reference for the multiprocessors in the initial stage.

Please cite this article as: J.-F. Tu, A novel memory management method for multi-core processors, Computers and Electrical Engineering (2015), http://dx.doi.org/10.1016/j.compeleceng.2015.10.009

JID: CAEE

ARTICLE IN PRESS J.-F. Tu / Computers and Electrical Engineering 000 (2015) 1–11

[m3Gsc;December 14, 2015;9:11] 11

References [1] Kaseridis D, Iqbal MF, John LK. Cache friendliness-aware management of shared last-level caches for high performance multi-core systems. IEEE Trans Comput 2014;63(4):874–87. [2] Claeys D, Bruneel H, Steyaert B, Mélange W, Walraevens J. Influence of data clustering on in-order multi-core processing systems. Electron Lett 2013;49(1):28–9. [3] Hong JH, Ahn YH, Kim BJ, Chung KS. Design of OpenCL framework for embedded multi-core processors. IEEE Trans Consumer Electron 2014;60(2):233–41. [4] Tu J, Saitoh K, Koshiba M, Takenaga K, Matsuo S. Optimized design method for bend-insensitive heterogeneous trench-assisted multi-core fiber with ultralow crosstalk and high core density. J Lightwave Technol 2013;31(15):2590–8. [5] Bernabe S, Sanchez S, Plaza A, Lopez S, Benediktsson A, Sarmiento R. Hyperspectral unmixing on GPUs and multi-core processors: a comparison. IEEE J Selected Topics Appl Earth Obs Remote Sensing 2013;6(3):1386–98. [6] Shnaiderman L, Shmueli O. Multi-core processing of xml twig patterns. Trans Knowl Data Eng 2013;6(6):2445–52. [7] Chang CW, Chen JJ, Kuo TW, Falk H. Real-time task scheduling on island-based multi-core platforms. IEEE Trans Parallel Distrib. Syst. 2015;26(2):538–50. [8] Wang J, Jagtap D, Abu-Ghazaleh N, Ponomarev D. Parallel discrete event simulation for multi-core systems: analysis and optimization. IEEE Trans Parallel Distrib. Syst. 2014;25(6):1574–84. [9] Pereyra M, Gallia N, Alasia M, Micolini O. Heterogeneous multi-core system, synchronized by a petri processor on fpga. IEEE Latin America Trans 2013;11(1):218–23. [10] Garcia-Guirado A, Fernandez-Pascual R, Garcia JM. In-cache coherence information (ICCI). IEEE Trans Comput 2015;64(4):995–1014. [11] Tensilica Inc, “Diamond standard processors data book, “2006, Santa Clara. USA, http://www.tensilica.com/products/xtensa_overview.htm. [12] Murata T. Petri nets: properties, analysis and applications. Proc. IEEE 1989;77(4):541–80. [13] Petri C.A., Communication with Automate, New York: Griffiss Air Force Base, Tech. Rep. RADC-TR-65-377, 1996; 1, Supply 1. [14] Burger D., Austin T.M., The SimpleScalar Tool Set, Version 2.0, University of Wisconsin-Madison Computer Sciences Department Technical Report #1342, June 1997. [15] Bishop B., Kelliher T.P., Irwin M.J., A detailed analysis of MediaBench, Processing of IEEE Workshop Signal Systems, SiPS 99, September 20 – 22, 1999, 448–455. [16] Lee C, Potkonjak M, Mangione-Smith WH. MediaBench: a tool for evaluating and synthesizing multimedia and communications systems, micro-architecture. In: Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Publication, 1-3; 1997. p. 330–5. Jih-Fu Tu received his B.S. and M.S. degrees from National Kaohsiung Normal University and National Taiwan Normal University. He received his Ph. D. degree in Computer Engineering from Preston University in the United States. He is an Associate Professor in the Department of Electronic Engineering at St. John’s University. His interests include computer architectures, SoC, and discrete event systems.

Please cite this article as: J.-F. Tu, A novel memory management method for multi-core processors, Computers and Electrical Engineering (2015), http://dx.doi.org/10.1016/j.compeleceng.2015.10.009