INTEGRATION, the VLSI journal 51 (2015) 72–80
Contents lists available at ScienceDirect
INTEGRATION, the VLSI journal journal homepage: www.elsevier.com/locate/vlsi
Memory customisations for image processing applications targeting MPSoCs David Watson, Ali Ahmadinia n School of Engineering and Built Environment, Glasgow Caledonian University, Glasgow, Cowcaddens Road, UK
art ic l e i nf o
a b s t r a c t
Article history: Received 26 May 2014 Received in revised form 10 June 2015 Accepted 11 June 2015 Available online 25 June 2015
Multiprocessor System on Chips (MPSoCs) are quickly becoming the mainstay in embedded processing platforms due to their hardware and software design flexibility. This flexibility increases the design space for developers, introducing trade-offs between performance and resource/power consumption. This paper presents a comprehensive evaluation of memory customisations for MPSoCs. Custom arrangements of instruction and data cache are presented to optimise off-chip memory consumption and improve system performance. Off-chip memory management and threading are presented to balance the computational load on available processors and improve system performance. The proposed methods are applied to an object detection case study, where performance increases of up to 2.93x are achieved when compared to standard memory designs. Furthermore, the proposed techniques can increase the number of possible processors in an MPSoC by reducing the number of bus interconnects. & 2015 Elsevier B.V. All rights reserved.
Keywords: MPSoCs Memory architectures Viola Jones Image processing
1. Introduction Multicore processing has become an integral part of modernday electronics and is used in products ranging from desktop CPUs, to tablets and mobile phones. Incorporating multiple cores onto one chip increases the computational power of the processor and increases application concurrency, making multicore processing an attractive platform for parallel applications, such as image processing. General-purpose multicore processors, such as those found in desktop PCs, have proven to be effective for executing face detection algorithms due to their high degree of parallelism, large amounts of cache memories, and high clock rates. However, embedded systems are far more resource-constrained than desktop systems and often have small amounts of on-chip memories and lower clock-rates. Therefore, the implementation of image processing algorithms to embedded systems requires more intricate optimisations of on- and off-chip memories. Multiprocessor System on Chips (MPSoCs) are growing in popularity within the embedded community, as they allow developers to customise at both the hardware and software levels. MPSoCs can be used for specific application domains to carry out dedicated tasks: that is, the application carries out the same computational tasks on all input data. Processor arrangement,
n
Corresponding author. Tel.: þ 44 141 331 8242. E-mail address:
[email protected] (A. Ahmadinia).
http://dx.doi.org/10.1016/j.vlsi.2015.06.004 0167-9260/& 2015 Elsevier B.V. All rights reserved.
inter- and intra-processor communications, and memory architectures all characterise MPSoCs and contribute to their flexibility and popularity. In this work, MPSoCs are implemented on Field Programmable Gate Arrays (FPGAs) through the use of soft-core processors – processors that do not exist on the FPGA fabric unless explicitly instantiated. Like most embedded systems, MPSoCs are constrained by power and resource consumption limitations, as well as application performance requirements, making the design of the memory architecture important. Image processing algorithms process large volumes of data in an often complex manner: object detection algorithms are one such example. Object detection algorithms perform the classification of images into positive and negative regions [19]; where positive regions contain objects of interest and negative regions do not. The classification process employed can impact system performance due to the unknown presence (and number) of objects in a given image. Therefore, the implementation of MPSoCs targeting object detection algorithms must complement the algorithm's data requirements and directly address the classification process employed. This paper presents memory customisations for image processing applications executing on MPSoCs. Data reuse techniques for instruction and data caches are proposed to improve system performance, on-chip memory consumption, and increase the number of processors in a system. Off-chip memory management is presented to complement the data buffers of a system, such as those found in image processing applications, and their interaction
D. Watson, A. Ahmadinia / INTEGRATION, the VLSI journal 51 (2015) 72–80
with processors. A custom multithreaded programming model is presented to complement the scan patterns of image processing algorithms and improve the balance of computational loads across processors. To our best knowledge, this is the first work to target these design points and memory customisations to the image processing algorithms targeting MPSoCs. This paper is organised as follows. An overview of the related work is provided in Section 2. The MPSoC development environment for this investigation is described in Section 3 and the Viola Jones face detection algorithm summarised in Section 4. The proposed on- and off-chip memory customisations are described in Sections 5 and 6 respectively. The experimentation methodology is described in Section 7 along with results. A summary of the key points and final remarks are given in Section 8.
2. Related work A review of literature encompassing multicore processing for the Viola Jones algorithm and MPSoC design is provided to illustrate the problem domain of MPSoC design and object detection algorithm implementation. Lai et al. [14,15] optimise the data locality of the Viola Jones algorithm for a multicore processor through loop-transformations to improve cache performance. Chiang et al. [4] simulate a concurrent implementation of Viola Jones with up to 64 ARM v5 cores, where the inherent parallelism of the algorithm is exploited with software-level optimisations. The Viola Jones algorithm is implemented by Chen et al. [2] who use task interleaving, reordering, and splitting to improve load balancing on the IBM Cell. Hefenbrock et al. [7] use GPUs to implement the Viola Jones algorithm, where SIMD/SIMT instructions are used to improve memory performance through collaborative memory accesses. Deepak Shekhar and Varaganti [6] implement a dual-core face detection engine and use data-partitioning to reduce false-sharing within caches. Chen et al. [3] implement a parallelised version of a body tracking algorithm for multicore processors, where data- and functional-level parallelism are used to parallelise the algorithm. Ranjan and Malik [17] present a multicore implementation of a face detection and tracking system that uses aspects of the Viola Jones algorithm. The algorithm is parallelised by allocating an equal portion of input data to each core. Ach et al. [1] implement a multicore road-sign detection algorithm that uses the same classification engine as the Viola Jones algorithm, with circular buffers to organise threads. The above work focuses on software-level optimisations to improve the performance of multithreaded image processing algorithms for multicore processors. These implementations are constrained by the general-purpose architecture of multicore processors, where shared caches (between threads and cores) degrade performance through contention and thrashing within the resource. To address these shortcomings, the proposed memory customisations target data reuse at the hardware level and memory design at the software level. Core load imbalances are addressed through custom memory management and multithreading. Cache performance analysis is carried out to optimise the cache layout and data reuse of the system. On the other hand, there exists a large catalogue of literature for multimedia and networking applications targeting MPSoCs, a brief overview is provided. Cho et al. [5] use compiler and software optimisations, as well as a hardware-based data access record to exploit the data reusability of multimedia application with data intensive loops. Scratch Pad Memories (SPMs) are used for data reuse at runtime and the compiler is trained with representative data to optimise this data reuse. SPMs have been used extensively within the literature to provide data locality and reuse [10,12,13].
73
Wolf [21] provides a study of MPSoCs, where WCETs, hardware and software optimisation techniques, and memory partitioning are highlighted as being key design considerations. Huerta et al. [8] identify that the processor arrangement of soft-core processors on an FPGA is similar to the symmetric multiprocessor (SMP) model; where the main disadvantage is the lack of an OS to schedule tasks. Wolf et al. [22] discuss the application-specific nature of MPSoCs, and how they require a dependant programming model. Based on this review, we make the following observations of image processing applications executing on MPSoCs: 1. Efficient and effective data reuse must be implemented. 2. Implementing image processing algorithms on MPSoCs requires a dependant programming model. 3. Memory design and interaction can directly impact system performance. 4. The number of possible processors in a system should be scalable to facilitate parallel processing.
This paper addresses these points through custom memory design and testing with a data-intensive face detection algorithm. Cache analysis is carried out to customise instruction and data caches of the system. A cost function is presented to optimise the utilisation of on-chip memory and increase the number of processors that can be implemented within an MPSoC. Memory management is presented to facilitate the interaction and unification of MPSoC processors with off-chip memory, as well as a dependent multithreaded programming model for image processing algorithms executing on MPSoCs. All customisations are robustly tested with large datasets and evaluated by their performance, power consumption, and resource consumption. Where possible, results are compared to the literature to highlight the benefits of the proposed memory customisations. First, an overview of the development environment is provided for the reader's benefit.
3. MPSoC development environment The proposed MPSoC architectures in this paper are implemented and tested on a Virtex-6 XC6VLX240T-1FF1156 FPGA provided by Xilinx ML605 development board [25]. MPSoC designs are created using soft-core processors through the accompanying development tools. MPSoCs implemented resemble a multicore Symmetric Multiprocessor (SMP) machine, where all processors of the system are connected to memory modules via a common bus. The common bus used in this work is the Advanced eXtensible Interface 4(AXI4) – a bus matrix topology providing high throughput and connectivity to up to 16 masters and slaves. Peripherals are accessed by processors via a light-weight version of AXI4 known as AXI-lite. Fig. 1 illustrates a sample MPSoC containing 2 processors equipped with level-one Instruction Cache (IC) and Data Cache (DC). The main memory module used is a 256MB DDR3 module clocked at 200 MHz. Each processor has customisable computational logic, cache arrangements, and local memory. As a result of the development tools, each processor must be programmed and synchronised independently with every other processor in the system. Furthermore, the debug module available for the tools used in this work supports only eight processors – limiting the implemented MPSoCs to 8 processor. However, this is an artefact of the development tools and not the proposed customisations. The Viola Jones face detection algorithm is described next.
74
D. Watson, A. Ahmadinia / INTEGRATION, the VLSI journal 51 (2015) 72–80
4. Vola Jones face detection algorithm Haar_Feature_2
The Viola Jones face detection algorithm [19] has been used for many years in both commercial and academic settings to accurately detect faces in images. The algorithm scans an image with a search window of 20 20 pixels looking for facial features. If enough facial features are found, the window contains a face, otherwise it does not. To account for faces of different sizes, the input image is resized to generate image scales. The features of the algorithm are arranged into a degenerative decision tree, known as a cascade. The cascade is arranged into stages, which are structured such that early stages contain less computation, but reject a high percentage of search windows. To facilitate the fast computation of the features, look-up tables known as integral images are used, where each pixel of an integral image contains the sum of the pixels above and to the left of it inclusively, as shown X iiðx; yÞ ¼ x0 rx; y0 r yiðx0 ; y0 Þ ð1Þ where ii(x,y) and i(x,y) are the integral image and original image respectively. This representation allows any rectangular region of an integral image to be evaluated using only 4 pixels, as shown in Fig. 2. For example, the sum of the pixels in region 4 is evaluated using Region Sum ¼ ðA B C þ DÞ
ð2Þ
Facial features are detected using 2D wavelets known as Haarfeatures, which are composed of 2–3 of the rectangular regions described in Fig. 2. Each rectangle is stored as a list of coordinates relative to the search window and a weight to normalise the region. These rectangles are computed according to Rectangle Sum ¼ ðA B C þ DÞnW
ð3Þ
A stage is stored as a list of Haar-features and a stage threshold. The structure of stage one of the cascade and its two-rectangle Haar-features is illustrated in Fig. 3. The OpenCV v2.0 cascade [16] used for this investigation contains twenty-two stages, with higher ordered stages containing progressively more Haarfeatures. A search window enters the cascade at stage 1 and is Timer
Mutex
IC
Instruction Cache
DC
Data Cache
Proc_1
Proc_0 IC
Mailbox
IC
DC
DC
AXI4 Interconnect AXI4-Lite Interconnect AXI4 Master AXI4 Slave AXI-Lite Master AXI-Lite Slave
Main Memory
Fig. 1. MPSoC containing two processors, depicting level-one instruction and data caches, shared main memory, and sample peripherals.
1
2 A
3
B 4
C
D
Search Window
Fig. 2. Evaluating image region using integral images.
Haar_Feature_1
A
B
C
D
W
Rect_1
Haar_Feature_0
A
B
C
D
W
Rect_0
Feature Left Threshold
Stage Threshold
Right
Stage_0
Fig. 3. Conceptual structure of the Viola Jones’ detection library: first stage of the cascade (left) and the structure of a 2-rectangle Haar-feature (right).
All Sub Windows
Stage 1
Stage 2
Stage 22
Face Detected
Reject Sub Window
Fig. 4. Cascade of Viola Jones where candidate search windows enter at stage 1 and positive search windows exit at stage 22.
rejected if it fails at any stage. If a search window passes all twenty-two stages it is deemed to contain a face as shown by Fig. 4. This process is carried out for all search windows contributing to approximately 90% of the algorithm's runtime. Large images, and images containing many faces, require more processing time, as they contain larger search spaces and use more stages of the cascade respectively.
5. On-chip memory customisations The proposed data reuse and cache customisations aim to optimise the use of on-chip memory resources for MPSoCs. The resulting architectures from these customisations are comparable to Uniform Memory architectures (UMA) on account of the consistent access latencies of each resource. The customisations are first discussed for the general case and then described for the Viola Jone case study. 5.1. Instruction Cache Generally, instruction Cache (IC) is used to promote the reuse of instructions previously accessed from main memory with the goal of improving processor performance. Image processing applications consist of computational kernels repeatedly applied to input data. Therefore, the program instructions used to encode the functionality of these kernels will be the same for all processors applying the kernels to the input data. In doing so, if the IC of each processor has sufficient capacity, the instructions for the kernel may always reside in the IC after a certain amount of time has elapsed. This theory was tested through simulation by measuring the IC misses for one search window with the Viola Jones algorithm, which equates to the main kernel of the application. Fig. 5 shows the number of cache misses against time for varying cache sizes. As can be seen, the number of cache misses decreases to 0 over time for the detection process with cache sizes 1–64 KB, demonstrating the repeatability of instructions within the application. Storing these instructions locally would allow the initial cache misses to be avoided. Furthermore, these cache miss patterns will repeat for each search window of the application. Therefore, it
D. Watson, A. Ahmadinia / INTEGRATION, the VLSI journal 51 (2015) 72–80
Data Cache Misses 80 70 60 50 Misses
would be more beneficial to statically store instructions within the cache and avoid unnecessary cache misses. The local memory of a Microblaze processor boasts the same access time as level-one cache and can be pre-loaded with application instructions – guaranteeing 1–2 cycles access latency to application instructions. Access to local memory is carried out using the Local Memory Bus (LMB) as shown in Fig. 6. Using the LMB for instruction accesses provides two clear advantages. First, the number of transactions over the AXI4 bus reduces, as accesses to instructions take place over the LMB. And secondly, the number of masters on the AXI4 is reduced by a factor of 2 since the IC must be connected as a master to the bus. To better illustrate this, let m be the number of masters supported by an AXI4 bus. A standard processor using IC and DC requires 2 master slots on the AXI4 bus, whereas the proposed - optimised - processor requires only 1 (see Fig. 6). The number of processors supported by the standard program memory is m=2 and the number of processors supported by the optimised program memory is m. The number of processors per MPSoCs using the optimised program memory layout can therefore be twice that of the standard.
75
40 30 Library and Input Data Input Data Only
20 10 0 10
20
30
40
50
60
70
80
90
100
Time Fig. 7. Number of data cache misses as a product of time for the Viola Jones algorithm with varying cache sizes and shared cacheable region.
Instruction Cache Misses
For the Viola Jone case study, the code size is 14.56 KB and can therefore be stored in 16 KB of local memory. If an algorithm's code size is larger than the amount of on-chip memory available, it would not be logical to store the entire code section in local memory. For such cases, profiling can be used to identify the most commonly used kernels and the corresponding instructions can be stored locally, with less frequently used instructions stored offchip.
10 9 8
1KB 4/16/64KB
Misses
7 6 5
5.2. Data Cache 4 3 2 1 0 10
20
30
40
50 60 Time (us)
70
80
90
100
Fig. 5. Number of instruction cache misses as a product of time for the Viola Jones algorithm with varying cache sizes.
IC DC
Instruction Cache Data Cache AXI4 Interconnect AXI4 Master AXI4 Slave Local Memory Bus
Proc_0 IC
DC
Proc_1 IC
DC
Proc_0 Program Data
Proc_1 Program Data
Proc_0 Application Data
Proc_1 Application Data
Local Memory
Local Memory
Program Data
Program Data
Proc_0
Proc_1
DC
DC
Proc_0 Application Data
Proc_1 Application Data
Fig. 6. Standard program memory arrangement with I-cache (a) and optimised program memory (b).
Data Cache (DC) aims to reduce the number of main memory accesses by storing frequently used data close to the processor. However, for image processing applications, and in particular object detection, there are different types of data that may contend for the DC. The libraries of image processing algorithms and the input data are one such example of contention. The library (kernel and detection library) is applied to all input data and is frequently used. However, input data is also frequently used, and for detection algorithms may be used as often as the library. Therefore, there lies a tradeoff between reusing application data and input data. The focus of data cache customisations in the article is to reduce data contention through tradeoff analysis of the resource used to access the data. Fig. 7 shows the results of DC analysis when the Viola Jones detection library does and does not contend with input data over the cacheable range. Note cache misses were the same for all cache sizes tested (1–64 KB). As can be seen, sharing the DC between the library and input data leads to more cache misses, whereas using the DC solely for application data leads to a reduction in cache misses. To model this tradeoff, we present a cost function. A cost function is provided in [23] for copying data to different memory hierarchies for read-only data. Detection libraries of object detection applications, such as Viola Jones, can be classed as one read instruction in a nested loop [23], since the processing of the stages of the cascade are consecutive. Due to the dependency between early and latter stages of the detection library, reusing the library would require the processing of each stage for every search window, which could increase the complexity of the software and its control operations. Instead, we present a cost function to determine the optimal level of data reuse for lookup libraries, which is a weighted function of the power required to access the memory and the memory's size. We assume the library is composed of arrays of data that are accessed in a
76
D. Watson, A. Ahmadinia / INTEGRATION, the VLSI journal 51 (2015) 72–80
The cost function for each array with on- and off-chip storage is analysed through Pareto analysis. Fig. 8 presents Pareto analysis for the arrays of the Viola Jones detection library using the proposed cost function. Through analysis of the plot, we store consecutive points that minimise the cost function on-chip and all other points (arrays) off-chip. It can be seen that arrays 1–3 satisfy this principal and that array 5 also has a low cost function. However, it does not satisfy the consecutive access of the detection library and is therefore disregarded.
Pareto Analysis of On- and Off-Chip Memories 14000 21
12000
22
Off-Chip Cost Function
19
10000
20
18 16 17 15
8000 13
12 14
6000
6. Off-chip memory customisations
11 10
4000 1
2
2000
78 46
The following subsections describe the proposed off-chip memory customisations.
9
3 5
6.1. Memory design and management 0 0
2000
4000
6000
8000
10000
12000
14000
On-Chip Cost Function Fig. 8. Pareto analysis of the cost function for on- and off-chip memories for the arrays of the Viola Jones detection library.
Fragmentation
DC
Data Cache
Free Space
AXI4 Link
Proc_0
Proc_1
Proc_0
Proc_1
DC
DC
DC
DC
Allocation_0 Allocation_0 Allocation_1 Allocation_1 Allocation_2 Allocation_2
Allocation Table Proc_0 Allocation_0
Proc_1
Allocation_1 Allocation_2
Allocation Table Proc_0 Proc_1 Allocation_0 Allocation_0 Allocation_1 Allocation_2
Allocation_1 Allocation_2
Fig. 9. Conceptual illustration of (a) standard main memory management and (b) the proposed, where data allocations are shared between processors through global allocation addresses.
Allocation Active Region Scanning Pattern Fig. 10. Scan pattern of image processing applications, where an active region is moved horizontally and vertically across an image.
serial manner. The cost function for an array s of the library is described by costðsÞ ¼ α powerðsÞ þ β areaðsÞ
ð4Þ
where 1. α and β are weighting factors; 2. power(s) is the power consumed by the memory storing array s; 3. area(s) ¼size of array s.
The main issue for MPSoCs when interacting with shared main memory can be regarded as a hardware and software problem. Processors of the MPSoC have full access to main memory over the shared bus, but are not aware of other processors’ interactions with the module without the use of hardware or software intervention. All memory operations are therefore performed naively by processors. Fig. 9 illustrates how main memory may look with two processors interacting with data allocations and their respective ownership of these allocations. For the case where allocations are made during runtime (a), processors are only aware of their own allocations to memory, which can lead to processing load imbalances and fragmentation. The first step in designing and managing memory for an MPSoC is to analyse the memory requirements of the target application. This process has been automated by many authors, such as [11]. However to our best knowledge there has been no automation of the Viola Jones algorithm's memory requirements. Furthermore, we are interested in the merits of such an analysis to overcome the difficulties MPSoCs introduce with memory design and management. Therefore, a custom analysis of the Viola Jones algorithm's memory requirements is provided. Viola Jones searches an input image at multiple scales in order to detect objects from foreground to background. This is achieved by converting the input image into multiple scaled images at a specified reduction factor. Each of these image allocations represents the search space of the algorithm and each is searched accordingly. Using the 80% reduction factor specified by [19] for VGA input images (640 480) yields 15 such allocations that must be processed. If we were to use the standard method of memory management for the processors of an MPSoC (Fig. 9 (a)), each processor would be assigned an allocation and process it, which could lead to load imbalances and sub-optimal memory utilisation and interaction. To address this issue, allocations are statically assigned to memory and accessed through pre-assigned addresses, as shown in Fig. 9(b). It can be seen how the ownership of the allocations has changed such that each processor now knows of all allocations within the system. This in turn reduces memory fragmentation and any chance of processors corrupting memory. The address of each allocation is stored within application code to allow each processor access to it, thus unifying the address space of the MPSoC. To ensure the allocation are thread-safe, I.e. atomic accesses to allocation where required, a dependent multithreaded programming model is presented that facilitates the best-case loading of computation on each processor of the MPSoC. 6.2. Multithreaded programming Through the design of main memory to accommodate the data buffers of an application, the last consideration is processor
D. Watson, A. Ahmadinia / INTEGRATION, the VLSI journal 51 (2015) 72–80
interactions with these buffers. Image processing applications are repetitive in nature and process image data using scan patterns. Fig. 10 illustrates an example scan pattern for object detection, where the active region represents the search window. To implement multithreaded interaction with these allocations, we must ensure thread interactions do not interfere with each other. Assume an allocation is to be processed by n threads, each allocation is partitioned into n slices: one for each thread. To ensure threads operate on their own slice exclusively, we must ensure the active region of each thread does not converge. This is achieved by ensuring each active region is separated by a safe distance, where a unit of distance is defined as a row or column of the allocation. The distance is calculated based on the size of the active region, which for image processing applications is the computational kernel: e.g. filtering kernel and search window. For Viola Jones, the kernel is the search window, which is a constant 20 20 units in size. If an allocation is to be processed by n threads, each active region must be separated by at least one unit of distance. If this cannot be satisfied, the allocation is processed by the number of threads who do satisfy this rule. Fig. 11(a) illustrates this for an allocation being processed by two threads, where the active regions overlap and are not separated by at least one row and are therefore not safe. It can be seen that Fig. 11(b) is partitioned into two slices (one for each active region) and each threads’ active region is therefore separated by a safe distance. The partitioning of allocations across the available processors of the system allows loads to be balanced according to the volume of data to be processed, whereas using the standard memory management technique would only allow loads to be assigned based on the number of allocations in the system. Each thread of
Unsafe Distance
77
the system is therefore characterised by the volume of data it must process and allows each processor of the MPSoC to be given a bestcase computational load of the application. However, only equal portions of data can be allocated to processors. Actual computational load is dependent on the complexity of the data within each slice and cannot be known before runtime.
7. Experimental results This section presents the results of extensive testing with the proposed on- and off-chip memory customisations and image processing algorithms on MPSoCs. First, the employed testing methodology and test images are discussed. 7.1. Test methodology VGA test images (640 480) are used to provide a comprehensive and robust evaluation of the memory customisations due to their large search space and wide use in modern imaging systems. Images from the CMU faces dataset [18] are normalised to VGA dimensions and used to test standard on- and off-chip memory designs and the proposed optimised IC/DC, data reuse, custom memory management, and multithreading model. Test images can be found in Fig. 12. Results are presented as performance increases, processing loads, and resource and power consumption. The execution time of each processor is obtained via onboard timer peripherals and processing loads are unchanged for all test images. MPSoCs implemented and tested resemble SMPs. Processors of an MPSoC are Microblaze soft-core processors in the performance optimised configuration interfaced with shared offchip memory using the AXI4 bus and peripherals with the AXI-lite bus respectively. All processors, on-chip memories, and the AXI4 bus are clocked at 100 MHz. Off-chip memory is clocked at 200 MHz and the AXI-lite bus is clocked at 50 MHz.
Safe Distance
7.2. Instruction and Data Cache
Allocation Slices Active Regions Scanning Pattern Fig. 11. Implementing thread safety for allocations. Unsafe thread interactions with an allocation (a) and safe interaction by allocation slicing (b).
The performance increase trends as a result of the instruction and data cache customisations are presented in Fig. 13 for up to 8 processors. It can be seen that the proposed customisations outperform standard designs for all processor counts. IC customisations outperform 4–8 standard-processor designs by up to a constant 1.03 indicating that the customisation will continue to bring the performance increase to MPSoCs with larger processor counts. However, the largest performance increase can be seen by the DC customisations, with performance increases of up to 1.71 for 8-processor designs. This increase in performance can be attributed to the data reuse implemented by the proposed cost
Fig. 12. Sample test images: (a) simple, (b) average, and (c) complex [18].
78
D. Watson, A. Ahmadinia / INTEGRATION, the VLSI journal 51 (2015) 72–80
Normalised Performance of Cache Optimisations
Normalised Performance Increases of Proposed Designs
3
1.8
Simple Images Average Images Complex Images
2.8
1.7
2.6 Peroformance Increase (x)
Performance (x)
1.6 1.5 1.4 1.3 IC Customisations DC Customisations
1.2
2.4 2.2 2 1.8 1.6 1.4
1.1
1.2
1
1 1
2
4
8
1
2
4
No. Processors Fig. 13. Normalised instruction and data cache optimisations results.
−8
9 8
Execution Time (s)
7 6
x 10
8
Processors Fig. 15. Performance increase trends of proposed customisations compared to standard designs for Viola Jones.
Processing Loads for the Barcelona Test Image Table 1 On-chip resource consumption of all designs.
PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7
Processors
[20]
Proposed (%)
1 2 4 8
11.12% 22.24% 44.48% Not possible
1.71 3.42 6.84 13.68
5
the address space of the MPSoC becomes global, allowing processors to access all data allocations of the application. The proposed multithreading programming model allows allocations to be processed equally by all processors of the system, better balancing processor loads and reducing the WCET of the applications.
4 3 2 1
7.4. Performance increases
0 Proposed
Standard
Multithreaded Programming Technique Fig. 14. Processor loads for proposed and standard memory management techniques.
function. By implementing the optimal data reuse for the Viola Jones case study, DC misses are reduced and data reuse of the detection library is maximised for each processor, resulting in scalable performance increases for MPSoC designs. We can therefore conclude that the proposed cache customisations provide reliable performance improvements when compared to standard cache arrangements. Furthermore, the complexity and time taken to implement these changes is very low giving further benefit to their adoption and implementation. 7.3. Memory management and multithreading Fig. 14 presents normalised processor loads for 8 processor MPSoC designs using the proposed and standard memory management and multithreading. Standard deviation for the standard technique was 1.4383 s, whereas the proposed technique achieved a standard deviation of 0.1378 s. This is a direct consequence of the MPSoC development environment used in this paper and the collaborative benefits possible through the proposed multithreading techniques. By implementing custom memory management,
Performance increases of MPSoCs using all proposed customisations compared to standard MPSoCs are presented in Fig. 15 for three categories of test images shown in Fig. 12. Images containing a small number of faces are computationally simple, whereas images with large numbers of faces. As can be seen, the proposed customisations far outperform standard MPSoCs for all designs and test images. For the computationally simple images (e.g. Fig. 12(a)), it can be seen that performance increases in excess of 1.53–2.93 were measured. For complex images (e.g. Fig. 12(c)), performance increases ranged from 1.22–2.34 for 2–8 processor MPSoCs. Images with an average of four faces (e.g. Fig. 12(b)) achieved performance increases of 1.43–2.75 . The instruction cache customisations reduce traffic to main memory and allow processors to operate with a constant instruction-fetch latency. The proposed cost function implemented optimal data reuse for the system: reducing accesses to main memory, reducing contention for resources, and reducing traffic and contention on shared buses by keeping access local. And lastly, the multithreaded programming model improves processing loads across all processors of the system. 7.5. Resource usage Previous implementations [20] used 11.12% of on-chip resources per processor, whereas the proposed uses only 1.71%:
D. Watson, A. Ahmadinia / INTEGRATION, the VLSI journal 51 (2015) 72–80
Table 2 Dynamic power consumptions comparison. Processors
[20] (W)
Proposed solutions (W)
1 2 4 8
0.702 0.929 1.394 Not possible
0.443 0.525 0.627 0.923
Table 3 Frame-rate comparisons for proposed solutions and the literature. Architecture
Processors
Normalised frame-rate (f/s)
[4]
2 4 2 4 2 4
0.134 0.268 0.096 0.18 0.23 0.42
[14,15] Proposed
ARM V5 ARM V5 ARM V5 ARM V5 Microblaze Microblaze
Power consumption details are obtained from the Xilinx Power Analyser [24]. A summary of the dynamic power consumption of [20] and the proposed memory customisations are presented in Table 2. For reference, a standard Microblaze processor with 32KB IC/DC and 3-stage pipeline has a dynamic power consumption of 0.095 W. Power consumption results are not available for [4,14,15]. The proposed solutions achieved lower static power consumptions for all designs on account of the smaller amounts of on-chip memory used. From Table 2 it can be seen that the proposed IC and DC customisations lead to a decrease in the dynamic power consumption, by up to 1.115 W, due to the smaller number of onchip memories used for each processor. When we correlate the performance increases and the power consumptions of [20] and the proposed customisations, it can be concluded that the proposed MPSoC architectures improve performance whilst reducing power consumption of MPSoCs, making them ideal for embedded multiprocessor systems.
Performance Increases of Proposed Designs Viola Jones Sobel Dilate Erode Standard Performance
Performance (x)
6
5
4
3
7.7. Literature comparison
2
1 1
2
4
8
Processors Fig. 16. Performance increase trends of proposed and standard designs with image processing benchmarks.
Performance of Proposed Designs 600 Theoretical Proposed 500
400 Performance (x)
standard designs use on-chip resources of 3.42% per processor. These resource consumptions have a massive impact on the scalability of MPSoCs as can be seen from Table 1. Reducing onchip resource consumption of each processor by 9.41% increases the scalability of the MPSoCs and the allows the implementation of larger designs by consuming less resources that [20]. For 8 processor designs, the proposed customisations consumed as much on chip resources as the standard 4-processor MPSoC and fractionally more than the 1-processor MPSoC of [20]. It can also be seen that the designs of [20] could not be realised on the FPGA due to the high levels of SPM usage. The proposed on-chip memory customisations therefore improve the performance and scalability of MPSoC designs executing image processing applications. 7.6. Power consumption
8
7
79
300
200
The proposed memory customisations are compared to the literature based on the normalised frame-rates obtained for 2–4 processor MPSoCs and the same test image (Fig. 12(a)). Table 3 summarises the results. It should be noted that the proposed customisations are implemented on homogeneous MPSoCs, which are readily comparable to the SMPs presented in this section [8,9]. Literature frame-rates are normalised based on the processor clock speed to normalise the speed of on-chip memories. It can be seen that the proposed customisations surpass the frame-rates of [4,14,15] for dual- and quad-core designs respectively, with improvements of up to 2.4 ; highlighting the performance benefits of the proposed customisations. The literature makes use of general-purpose ARM multicore processors, where optimisations are performed at the software-level. However, the proposed memory customisations target the architecture of MPSoCs directly and customise designs based on application characterisation. Therefore, it can be postulated that the performance of multicore processors could be improved through the adoption of the proposed memory customisations. Furthermore, the customisations proposed can be easily implemented by any competent software or systems engineer. 7.8. Image processing performance
100
0 1
2
4
8
16 32 Processors
64
128
256
Fig. 17. Extrapolated performance of proposed memory customisations.
512
Fig. 16 presents the results of testing the proposed customisations with well-known image processing applications: Sobel, Dilate, and Erode filtering processes, together with the standard performance of the applications. As can be seen, the proposed customisations far outperform the performance of standard
80
D. Watson, A. Ahmadinia / INTEGRATION, the VLSI journal 51 (2015) 72–80
architectures for all tested image processing applications. Note the results presented are presented as the performance of the system on a processor-to-processor basis. The Dilate and Erode benchmarks yield the same performance metrics on account of their operations only differing by the type of kernel used: all computation and memory accesses remain the same. The Sobel edge filtering application performs better than Dilate and Erode as its kernel contains zero-constants that are computationally simpler. The large performance increase for 4 processors using the Sobel benchmark is a result of the simpler computational kernel and the local storage of the kernel improving processor performance. These results therefore demonstrate the benefit of the proposed memory customisations. Lastly, Fig. 17 presents the extrapolated performance of the proposed memory customisations and theoretical multiprocessor performance. As can be seen, the proposed customisations follow a similar trend to the theoretical performance increase of multiprocessor designs. the validation of this performance trend is subject to future work, where the impact of bus arbitration and increased main memory users on the proposed memory customisations will be investigated. 8. Conclusions This paper presents memory customisation and management techniques for image-processing applications executing on Multiprocessor System on Chips (MPSoCs). Through on-chip resource optimisation targeting instruction and data caches, and off-chip management and multithreading techniques, performance increases of up to 2.93 were obtained. On-chip customisations led to resource consumption decreases of up to 9.41% per processor: increasing the scalability of MPSoC designs. Optimised core loading improved system performance through collaborative processing of input data. Based on these results, it is believed that adopting the proposed memory customisation to general-purpose multicore processors may lead to similar performance increases and power/resource consumption reductions for multithreaded image processing applications. References [1] R. Ach, N. Luth, A. Techmer, Real-time detection of traffic signs on a multi-core processor, in: 2008 IEEE Intelligent Vehicles Symposium, June 2008, pp. 307– 312. [2] S.-K. Chen, T.-J. Lin, C.-W. Liu, Parallel object detection on multicore platforms, in: IEEE Workshop on Signal Processing Systems, 2009. SiPS 2009, October 2009, pp. 75–80. [3] T. Chen, D. Budnikov, C. Hughes, Y.-K. Chen, Computer vision on multi-core processors: articulated body tracking, in: 2007 IEEE International Conference on Multimedia and Expo, July 2007, pp. 1862–1865. [4] C.-H. Chiang, C.-H. Kao, G.-R. Li, B.-C. Lai, Multi-level parallelism analysis of face detection on a shared memory multi-core system, in: 2011 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), April 2011, pp. 1–4. [5] D. Cho, S. Pasricha, I. Issenin, N. Dutt, M. Ahn, Y. Paek, Adaptive scratch pad memory management for dynamic behavior of multimedia applications, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 28 (4) (2009) 554–567. [6] T. Deepak Shekhar, K. Varaganti, Parallelisation of face detection engine, in: 2010 39th International Conference on Parallel Processing Workshops (ICPPW), September 2010, pp. 113–117. [7] D. Hefenbrock, J. Oberg, N. Thanh, R. Kastner, S. Baden, Accelerating violajones face detection to fpga-level using gpus, in: 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), May 2010, pp. 11–18.
[8] P. Huerta, J. Castillo, C. Sanchez, J. Martinez, Operating system for symmetric multiprocessors on fpga, in: International Conference on Reconfigurable Computing and FPGAs, 2008, ReConFig ’08, 2008, pp. 157–162. [9] A. Hung, W. Bishop, A. Kennings, Symmetric multiprocessing on programmable chips made easy, in: Proceedings of Design, Automation and Test in Europe, 2005, 2005, vol. 1, 2005, pp. 240–245. [10] Y. Iosifidis, A. Mallik, S. Mamagkakis, E. De Greef, A. Bartzas, D. Soudris, F. Catthoor, A framework for automatic parallelisation, static and dynamic memory optimisation in mpsoc platforms, in: 2010 47th ACM/IEEE Design Automation Conference (DAC), 2010, pp. 549–554. [11] Y. Iosifidis, A. Mallik, S. Mamagkakis, E. De Greef, A. Bartzas, D. Soudris, F. Catthoor, A framework for automatic parallelisation, static and dynamic memory optimisation in mpsoc platforms, in: 2010 47th ACM/IEEE Design Automation Conference (DAC), 2010, pp. 549–554. [12] I. Issenin, E. Brockmeyer, B. Durinck, N. Dutt, Data-reuse-driven energy-aware cosynthesis of scratch pad memory and hierarchical bus-based communication architecture for multiprocessor streaming applications, IEEE Trans. Comput-Aided Des. Integr Circuits Syst. 27 (8) (2008) 1439–1452. [13] I. Issenin, E. Brockmeyer, M. Miranda, N. Dutt, Drdu: A data reuse analysis technique for efficient scratch-pad memory management, ACM Trans. Des. Autom. Electron. Syst. 12 (April (2)) (2007). [14] B.-C. Lai, C.-H. Chiang, G.-R. Li, Classifier grouping to enhance data locality for a multi-threaded object detection algorithm, in: 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), December 2011, pp. 268–275. [15] B.-C. Lai, C.-H. Chiang, G.-R. Li, Data locality optimisation for a parallel object detection on embedded multi-core systems, in: 2011 IEEE 2nd International Conference on Software Engineering and Service Science (ICSESS), July 2011, pp. 576–579. [16] OpenCV, Open Computer Vision, May 2014. [17] A. Ranjan, S. Malik, Parallelizing a face detection and tracking system for multi-core processors, in: 2012 Ninth Conference on Computer and Robot Vision (CRV), May 2012, pp. 290–297. [18] C.M. University, Frontal Face Images, April 2012. [19] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001, CVPR 2001, 2001, vol. 1, pp. I-511–518. [20] D. Watson, A. Ahmadinia, G. Morison, T. Buggy, Custom memory architecture for multi-core implementation of face detection algorithm, in: Proceedings of the 23rd ACM International Conference on Great Lakes Symposium on VLSI, GLSVLSI'13, New York, NY, USA, 2013, ACM, Paris, pp. 125–130. [21] W. Wolf, Multiprocessor system-on-chip technology, IEEE Signal Process. Mag. 26 (6) (2009) 50–54. [22] W. Wolf, A. Jerraya, G. Martin, Multiprocessor system-on-chip (mpsoc) technology, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 27 (10) (2008) 1701–1713. [23] S. Wuytack, J.-P. Diguet, F. Catthoor, H. de Man, Formalised methodology for data reuse: exploration for low-power hierarchical memory mappings, IEEE Trans. Very Large Scale Integr. (VLSI) Systems 6 (December (4)) (1998) 529–537. [24] Xilinx. Xilinx power analyser, May 2014. [25] Xlinx. Virtex-6 fpga ml605 evaluation kit, September 2013.
David Watson received his BEng. (Hons) degree in Computer & Electronic Systems at the University of Strathclyde in 2009, and his M.Sc. degree in Wireless Communications Technologies from Glasgow Caledonian University in 2010. Since then he is a Ph.D. student in Glasgow Caledonian University working on Object Detection Techniques with Custom Memory in Multi-core Architectures.
Ali Ahmadinia received his Ph.D. degree from University of Erlangen-Nuremberg, Germany, in 2006. In 2004-2005, he worked as a research associate in Electronic imaging group, Fraunhofer Institute - Integrated Circuits (IIS), Erlangen, Germany. In 2006-2008, he was a research fellow in the School of Engineering and Electronics, University of Edinburgh, Edinburgh, UK. In 2008, he joined Glasgow Caledonian University, Glasgow, UK, where he is now a senior lecturer in embedded systems. His research has resulted more than 80 international journal and conference publications in the areas of reconfigurable computing, system-on-chip design, wireless and DSP applications.