SIMD programming using Intel vector extensions

SIMD programming using Intel vector extensions

Journal of Parallel and Distributed Computing 135 (2020) 83–100 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homep...

2MB Sizes 0 Downloads 62 Views

Journal of Parallel and Distributed Computing 135 (2020) 83–100

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc

SIMD programming using Intel vector extensions Hossein Amiri, Asadollah Shahbahrami



Department of Computer Engineering, Faculty of Engineering, University of Guilan, Rasht, Iran

article

info

Article history: Received 3 November 2018 Received in revised form 19 August 2019 Accepted 18 September 2019 Available online 28 September 2019 Keywords: Intel SIMD AVX AVX-512 Vectorization

a b s t r a c t Single instruction multiple data (SIMD) extensions are one of the most significant capabilities of recent General Purpose Processors (GPPs) which improves the performance of applications with less hardware modification. Each GPP vendor such as HP, Sun, Intel, and AMD has its particular Instruction Set Architecture (ISA) and SIMD micro-architecture with different perspectives. Intel expanded SIMD technologies from hardware and software point of view. It has introduced SIMD technologies such as MultiMedia eXtensions (MMX), Streaming SIMD Extensions (SSE), Advanced Vector eXtensions (AVX), Fused Multiply Add (FMA) and AVX-512 sets. During micro-processors developments path, register width has been extended from 64 bits to 512 bits and number of vector registers has been increased from 8 to 32. Wider registers provide more parallelism ways and more registers reduce extra data movement to the cache memory. In order to gain the advantages of SIMD extensions, many programming approaches have been developed. Compiler Automatic Vectorization (CAV) as an implicit vectorization approach, provides simple and easy SIMDization tools. While, performance improvement of CAV is not always granted, most compilers auto-vectorize simple loops. On the other hand, for explicit vectorization, Intrinsic Programming Model (IPM) provides low-level access to vector registers for SIMDizing. However, programming with IPM requires great amount of expertise especially in low-level architecture feature, thus, choosing the suitable instructions and vectorization methodology for mapping to a certain algorithm is important. Moreover, portability, compatibility, scalability and compiler optimization might limit the advantage of IPM. Our goal in this paper is as follows. First, we provide a review of SIMD technology in general and Intel’s SIMD extensions in particular. Second, some SIMD features of Intel SIMD technologies, MMX, SSEs, AVX, and FMA in terms of ISA, vector width, and SIMD programming tools are comparatively discussed. Third, in order to compare the performance of different auto-vectorizers and IPM approaches using Intel C++ compiler (ICC), GNU Compiler Collection (GCC) and Low Level Virtual Machine (LLVM), we map and implement some representative multimedia kernels on AVX and AVX2 extensions. Finally, our experimental results show that although the performance improvement using IPM approach is higher than CAVs, programmer needs more programming efforts and knows different mapping strategists. Therefore, extending autovectorizers abilities to generate more efficient vectorized codes is an important issue in different compilers. © 2019 Elsevier Inc. All rights reserved.

1. Introduction Single Instruction Multiple Data (SIMD) capability is one of the most significant aspect of General Purpose Processors (GPPs) [12]. Intel has been introducing SIMD extensions in its microprocessors since the late twentieth century [88]. MultiMedia eXtension (MMX) using 64-bit registers was introduced in 1997 [73] after that Streaming SIMD Extensions such as SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 using 128-bit registers have released [25]. Advanced Vector eXtensions (AVX) and AVX2 technologies have also been introduced using 256-bit registers to support floatingpoint and fixed-point operations, respectively [27]. In addition to ∗ Corresponding author. E-mail address: [email protected] (A. Shahbahrami). https://doi.org/10.1016/j.jpdc.2019.09.012 0743-7315/© 2019 Elsevier Inc. All rights reserved.

AVX2, Fused Multiply Add (FMA) has been released to perform the floating-point multiply-addition and multiply-subtraction. The FMA reduces the throughput of instructions compared to corresponding combined operations using AVX [33]. Finally, Intel submitted new revolutionary SIMD technologies to exploit wider registers in 2015, which are named AVX-512 that contains many subsets such as AVX-512F, AVX-512CD, AVX-512 PF, AVX-512ER, AVX-512BW, AVX-512DQ, AVX-512VL. Each SIMD technology has its own Instruction Set Architecture (ISA) which provides different arithmetic, logical, data transferring, rearrangement and special purpose instructions that can be used to improve the performance of many applications [34]. Intel’s SSEs provide more than 300 instructions to exploit SIMD capability. These instructions have been developed and

84

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

Fig. 1. A simple operation; (a) Scalar model performs an operation to produce a single result; (b) SIMD model performs the operation on m elements simultaneously.

expanded over the years. In each technology, new capabilities have been added [25]. The AVX improves vector instructions for single-precision and double-precision floating-point operations. It adds more than 80 new instructions and provides new operations such as permute and broadcast. Intel AVX employs VEX-encoded which is designed for compatibility problems and future maintenance. When VEX-encoded scheme instruction is mixed with non-VEX instructions (SSE legacy), penalties must be addressed [35]. The AVX2 adds more than 130 instructions to support SIMD instructions for integers. It provides enhanced functionality for broadcast and permute operations on data elements, vector shift instructions with variable-shift count per data element, and gather instructions to fetch non-contiguous data elements from memory [27]. Gather instructions have been released in Haswell micro-architecture suffering from much latency that has been enhanced in next generations such as Skylake micro-architecture [36]. AVX-512 has introduced many new instructions such as scatter, compress, expand, conflict, classifier, reduce, and ternary logic. It has also been enhanced the VEXencoded scheme and introduced new EVEX-encoded to cover all new features. In this work, we illustrate SIMD technology from architectural, parallel programming, and mapping strategy points of view. Our contributions are as follows. First, a summary of primary SIMD extensions of different processor companies is provided and Intel’s SIMD technologies up to AVX-512 are summarized and compared. Second, different SIMDization approaches, explicit and implicit SIMD programming using Intel C++ compiler (ICC), GNU Compiler Collection (GCC), Low Level Virtual Machine (LLVM) and Intrinsic Programming Model (IPM) are explained and compared in terms of performance improvement for some multimedia kernels. Our experimental results show that all compilers almost have the same behavior for IPM approach, while auto-vectorizer of each compiler has different behavior for each kernel. This paper is organized as follows. Background information about SIMD is presented in Section 2 and Intel SIMD extensions are explained in Section 3. Section 4 describes some of Intel SIMD instruction set architectures. Section 5 deals with SIMD programming models, tools and documents. In order to clarify how to map algorithms to SIMD architectures, Section 6 discusses mapping of some multimedia kernels on SIMD extensions using IPM approach. Performance evaluation and experimental results of both explicit and implicit vectorization using different compilers are presented in Section 7. Finally, the paper ends with conclusions in Section 8. 2. Single Instruction Multiple Data (SIMD) 2.1. Basic principles of SIMD technologies Nowadays single instruction multiple data is counted as a ubiquitous technology that can improve the performance of many

Fig. 2. (a) Single Instruction Single Data (SISD) versus (b) Single Instruction Multiple Data (SIMD) architecture.

engineering and multimedia applications significantly [22]. As its name indicates, instead of performing a single instruction on every single data, it provides the capability of using wider data-width for similar computational operations [83]. There are various SIMD technologies such as vector architectures, multimedia SIMD instruction set extensions, and Graphics Processing Units (GPUs) [23]. A simple scalar operation is depicted in Fig. 1(a) in contrast to a SIMD computation which is depicted in Fig. 1(b). It shows how the op operation is performed on m data elements simultaneously and compute the results of m different elements. Single instruction multiple data is a type of computation that performs an operation on packed data which its instructions are available in each SIMD technology extension [34]. The packed data are stored in vector registers or read from memory using memory instruction. For a particular operation, vector–vector instruction generates less micro-operations compared to memory–vector or vector–memory instructions [3]. This kind of computations is very common in multimedia processing which needs a single instruction to perform on a huge amount of data. One of the most beneficial point of this technology is lowering the overheads on hardware and bottom layers of software [34]. Besides, multi-threading as a popular paralleling approach can be used to perform SIMD instructions on multi-cores and manycores [18,60], or even more, it has been used in distributed systems and virtual machines [96]. In addition, it reduces the total power consumption [63]. 2.2. SIMD architectures SIMD architectures generally provide two kinds of SIMD instructions. The first are the SIMD computational instructions such as arithmetic instructions. The second are the SIMD overhead instructions that are necessary for data movement, data type conversions, and data reorganization [98]. The latter instructions are needed to bring data in a form amenable to SIMD processing. These instructions constitute a large part of the SIMD codes. Fig. 2 depicts the differences between Single Instruction Single Data (SISD) and SIMD. As it shows, vector registers containing multiple data participate the SIMD operation while a single element in a common register is used for SISD operations. A simple comparison between SIMD and scalar architectures is depicted in Table 1 from ISA point of view. As can be seen,

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

Fig. 3. An n-bit partitioned ALU that is divided into

n k

85

parallel functional units using the subword level parallelism concept.

Table 1 SIMD operation vs. scalar operations. Type of instructions

SIMD

Scalar

Arithmetic Data transfer Data type conversion Date reorganization Special purpose instruction

Yes Yes Yes Yes Yes

Yes Yes Yes No No

instructions such as data reorganization and Special Purpose Instructions (SPIs) are only available in SIMD instructions. These instructions are needed to make SIMD instructions practical [91]. For example, SPIs gain significant improvements compared to the equivalent SIMD operations [14,87]. In some applications, data are not ordered in correct forms, thus, SIMD reorganizing instructions are needed for implementing the SIMD version of the application, however, it incurs overheads on the implementation. For instance, the SIMD implementations of the MPEG/JPEG codecs using the VIS ISA require on average 41% overhead instructions such as packing/unpacking and data re-shuffling. The execution of this large number of the SIMD overhead instructions decreases the performance and increases pressure on the fetch and decode steps [80]. In sub-word level parallelism [65], as depicted in Fig. 3, multiple subwords are packed into a word, then the whole word is processed. This concept is used in order to exploit Data Level Parallelism (DLP) with existing hardware without sacrificing the general-purpose nature of the processor. In sub-word level parallelism, a register is viewed as a small vector with elements that are smaller than the register size. It also provides a very low-cost form of SIMD parallelism. Besides, SIMDization uses the functional units and the memory port without significant additional cost [83]. In addition, due to high speedups, using SIMD instructions can reduce the total power consumption [63], however, most SIMD instructions consume more power than the corresponding scalar instructions.

by Streaming SIMD Extension (SSE) and SSE2 from Intel [78]. Motorola’s AltiVec [17] has supported integer as well as floatingpoint media instructions. In addition, high-performance processors also used SIMD processing. An excellent example of this was the Cell processor [24] that has developed by a partnership of IBM, Sony, and Toshiba. Cell was a heterogeneous chip multiprocessor consisting of a PowerPC core that controls eight high-performance Synergistic Processing Elements (SPEs). Each SPE had one SIMD computation unit that was referred to as Synergistic Processor Unit (SPU). Each SPU had 128 × 128-bit registers. SPUs have supported both integer and floating-point SIMD instructions. Table 2 summarizes the common and distinguishing features of the primary multimedia instruction set extensions in different processors companies [83]. 3. Intel SIMD extensions Table 3 depicts particular characteristics of available SIMD extensions in Intel GPPs from 1997 to 2016. As Table shows, the SIMD registers width has been extended from 64-bit to 512-bit from MMX technology to AVX technologies. 3.1. Multimedia extensions Intel SIMD extensions were introduced by MultiMedia eXtensions (MMX) technology in 1997 using 57 instructions for the packed integer on eight 64-bit registers which were borrowed from floating-point registers [73,83]. It used sub-word level parallelism concept on 64-bit registers. In other words, a 64-bit ALU is partitioned into four parallel functional units for 16-bit sub-word registers. MMX had major defects such as the 80-bit floating-point unit was disabled, useful instructions such as max and min were not supported, and vector operations for floating operations were not designed. However, many researchers and programmers attended the MMX technology that ensured Intel to investigate SIMD capability and start to expand it for 128-bit separated vector registers of SSE technologies [58,89].

2.3. Historical summary 3.2. Streaming SIMD extensions The first multimedia extensions are Intel’s MMX [73], Sun’s Visual Instruction Set (VIS) [94], Compaq’s Motion Video Instructions (MVI) [13], MIPS Digital Media eXtension (MDMX) [57], which has never been implemented, and HP’s Multimedia Acceleration eXtension (MAX) [65]. These extensions supported only integer data types and were introduced in the mid-1990’s. 3DNow [8] was the first effort to support floating-point instructions by Advanced Micro Devices (AMD), Inc. It was followed

In 1999, Streaming SIMD Extensions (SSE) expanded 70 new instructions, used eight 128-bit separated xmm registers and provided packed single-precision floating-point operations. After a year, SSE2 added 144 new instructions for integers and doubleprecision floating-point operations. Then SSE3 introduced 13 new instructions followed by Supplemental SSE3 (SSSE3) with 32 new instructions which accelerate several types of computations on

86

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

Table 2 Summary of primary multimedia extensions. Sk and Uk indicate k-bit signed and unsigned integer packed elements, respectively. Values k without a prefix U or S in the last row, indicate operations work for both signed and unsigned values. GPP with multimedia extension

a

ISA name

MAX-1/2

VIS

MDMX

MMX/ SIMD

MMX/ 3DNow

SSE

SSE2

AltiVec/VMX

SPU ISA

Company Instruction set Processor

HP PARISC2 PA RISC PA8000

Sun P. V.9 Ultra Sparc

MIPS MIPS-V –

Intel IA32 P2

AMD IA32 K6-2

Intel IA64 P3

Intel IA64 P4

Motorola/IBM Power PC MPC7400

IBM/Sony/Toshiba – Cell

Year Datapath width Size of register file Dedicated or shared with

1995 64-bit (31) /32 × 64b Int. Reg.

1995 64-bit 32 × 64b FP Reg.

1997 64-bit 32 × 64b FP Reg.

1997 64-bit 8 × 64b FP Reg.

1999 64-bit 8 × 64b Dedicated

1999 128-bit 8 × 128b Dedicated

2000 128-bit 8 × 128b Dedicated

1999/2002 128-bit 32 × 128b Dedicated

2005 128-bit 128 × 128b Dedicated

Integer data types: 8-bit 16-bit 32-bit 64-bit

– 4 – –

8 4 2 –

8 4 – –

8 4 2 –

8 4 2 –

8 4 2 –

16 8 4 2

16 8 4 –

16 8 4 2

Shift right/left Multiply-add Shift-add

Yes No Yes

Yes Yes No

Yes No No

Yes Yes No

Yes Yes No

Yes Yes No

Yes Yes No

Yes Yes No

Yes Yes No

Floating-point Single-precision

No –

No –

Yes 2 × 32

No –

Yes 4 × 32

Yes 4 × 32

Yes 4 × 32

Yes 4 × 32

Double-precision









Yes 4 × 16 2 × 32 1 × 64



2 × 64



2 × 64

Accumulator # of instructions # of operands Sum of absolute-differences

No (9) 8 3 No

No 121 3 Yes

1 × 192b 74 3-4 No

No 57 2 No

No 24 2 Yes

No 70 2 Yes

No 144a 2 Yes

No 162 3 No

213 2/3/4 Yes

Modulo addition/ subtraction Saturation addition/ subtraction

16

16, 32

8, 16

No

S16

8, 16 32 U8, U16 S8, S16

8, 16 32, 64 U8, U16 S8, S16

8, 16 32,64 U8, U16 S8, S16

8, 16, 32

U16, S16

8, 16 32, 64 U8, U16 S8, S16

U8, U16, U32 S8, S16, S32

8, 16 32,64 – –

Note that 68 instructions of the 144 SSE2 instructions operate on 128-bit packed integer in XMM registers, wide versions of 64-bit MMX/SSE integer instructions [84].

Table 3 Summary of available Intel multimedia extensions with GPPs [19,84,85]. ISA Name

MMX

SSE

SSE2

SSE3

SSSE3

SSE4

AVX

AVX2

AVX3

Date Register width # of vector registers Integers 8-bit 16-bit 32-bit 64-bit Floating Point Single Precision Double Precision

1997 64 bits 8 Yes 8 4 2 – No – –

1999 128 bits 8 No – – – – Yes 4 × 32 –

2000 128 bits 8/16 Yes 16 4 4 2 Yes – 2 × 64

2004 128 bits 8/16 No – – – – Yes 4 × 32 2 × 64

2006 128 bits 8/16 Yes 16 4 4 – No – –

2007 128 bits 8/16 Yes 16 4 4 2 Yes 4 × 32 2 × 64

2011 256 bits 8/16 No – – – – Yes 8 × 32 4 × 64

2013 256 bits 8/16 Yes 32 8 8 4 No – –

2016 512 bits 32 Yes 64 16 16 8 Yes 16 × 32 8 × 64

vectors included the horizontal addition or subtraction operations, absolute values operations and in-place shuffling according to the third control operand. The SSE4 including SSE4.1, and SSE4.2 completed the 128-bit instructions set of Intel platforms. SSE4.1 introduced 47 new instructions for compiler vectorization improvement and packed dword computation support. The SSE4.2 added seven new instructions mostly for string/text processing [79]. The SSE, SSE2, SSE3, SSSE3, SSE4.1 and SSE4.2 which are named Intel SSEs in this paper, theoretically can be executed 128 times faster compared to corresponding sequential implek mentation when k is equal to the size of the data type in bits. Over the years, indebted to the promotion of technology, manufacturer’s competition and evolution of HPC, the AVX exploiting sixteen 256-bit ymm registers, is born within GPPs while each ymm register is logically viewed as two 128-bit lanes [97]. 3.3. Advanced vector extensions Advanced Vector Extensions (AVX) was introduced in the second generation of Intel Core processor family [27]. It focuses on

floating-point vectors performance for Digital Signal Processing (DSP), cryptography, scientific, engineering, numerical and many other applications. The primary purpose of AVX is improving the performance by changeable degrees of thread level parallelism, and data vector lengths [21]. The AVX has many similarities to the SSE and double-precision floating-point portion of the SSE2. It provides almost the same latency and throughput compared to the Intel SSEs extensions while register size is doubled. However, for example, dividing two single-precision floating-points vectors, a vdivps might cost more latency and throughput than corresponding SSE divps in Skylake micro-architecture [37]. The syntax of AVX instructions is slightly different from the previous extensions. It generally provides three-operand syntax instead of traditional two-operand syntax. Therefore, it is possible to exactly simulate c = a + b since one operand is the destination and two others are source operands. For instance, a vaddps ymm0, ymm1, ymm2 instruction, adds packed single-precision (8 × 32bit) floating-point elements in ymm1 with ymm2 and stores the result into the ymm0. Thus it saves the first operands unchanged

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

87

in Intel assembly syntax [36] and vice versa in AT&T assembly syntax [72] which generate the same machine code. This feature is called non-destructive source operand. The AVX theoretically, can demonstrate a speedup by a factor of two over the Intel SSE and can be executed 8x faster than corresponding sequential programs for single-precision floatingpoint vectors. However, the number of pipelining stages, number and type of execution units, compiler optimization techniques, limitation of SIMD programming models, memory overheads, and programmer’s knowledge may impact on the overall performance. It also has a relaxed memory alignment which omitted the penalty of using misaligned operands [22,38]. Fig. 4. AVX-512 registers in 64-bit processors.

3.4. Advanced vector extensions 2 The AVX2 instructions as a major factor in integers were introduced in 2013 by Haswell micro-architecture [56]. It extends the most Intel SSEs instructions with 256-bit vectors. The AVX2 instructions follow the same programming syntax of AVX instructions. It also provides enhanced functionalities for broadcast and permute operations, and added gather instructions to fetch data from non-contiguous memory addresses [37]. In addition to AVX2, Fused Multiply Add (FMA) was released to perform the floating-point multiply-addition and multiply-subtraction. FMA is the combination of the multiplication and add operations in such a way that can be used to perform a variant mixture of certain operations on the floating-point. Similar to AVX2, Intel’s FMA has been introduced in Haswell micro-architecture and designed for both vector and scalar operations [56]. In addition to add fusion, FMA fuses subtract operation instead of addition as well. Moreover, FMA provides a mixture of add and subtract operations to accumulate the multiplication results into the destination vector elements. For instance, _mm256_fmaddsub_pd intrinsic function subtract the results of even vector elements and add the results of odd vector elements to be accumulated into the corresponding vector register elements of the destination [37]. 3.5. Advanced vector extensions 3 The AVX-512 is the state-of-the-art Intel’s SIMD technology which is also known as AVX3 that was released in 2016 with Xeon Phi micro processors. AVX-512 is not the first effort of Intel to utilize 512-bit vector registers. Previously, Intel introduced Knights Corner (KNC) to support 512-bit SIMD operations [90]. The KNC is based on the Intel Many Integrated Core (MIC) architecture which was introduced in 2012 [4,62,74,93]. In addition, many design elements of MIC have inherited from Larrabee project [82] which was canceled in 2009. The registers of AVX-512 are depicted in Fig. 4. As can be seen, it provides 32 × 512-bit zmm registers which are available for previous technologies as well. AVX-512 has multiple subsets of instructions such as Fundamental (AVX-512F), Code Conflict (AVX-512CD), Exponential and Reciprocal (AVX-512ER), Prefetch (AVX-512PF), Byte and Word (AVX-512BW), Doubleword and Quadword (AVX-512WQ), Vector Length (AVX-512VL), Integer Fused Multiply Add (AVX-512IFMA52), Vector Bit Manipulation Instructions (AVX-512VBMI) [39]. Table 4 depicts ten AVX-512 subsets that are available in some Intel processors which are depicted in Fig. 5. As can be seen in this figure Skylake X and Xeon processors have five AVX-512 subsets, while Xeon phi processors have four subsets which two, AVX512CD and AVX-512F subset are available in all processors [40]. 4. Intel SIMD instruction set architectures In this section some new SIMD instructions are briefly presented.

Fig. 5. AVX-512 subsets which are supported in Xeon Phi, Xeon Skylake and Skylake Core X.

4.1. The ISAs comparison Table 5 depicts a comparison between Intel SIMD extensions from ISA point of view. Each SIMD extension provides new features based on vector register width, operation type, and application requirement. Many instructions are presented in the legacy of the prior SIMD extensions. VEX-encoded is provided to implement 128-bit instructions for AVX/AVX2 and EVEX-encoded is built on VEX-encoded to provide more features for AVX-512. Intel AVX has introduced new instructions which are not presented in the legacy of Intel SSEs such as vpermilps and vpermd instructions for permutation [20]. It employs the VEX-encoded scheme that supports 128-bit operations with three and four operands. It operates on the lower half of ymm registers and zeros the upper half. VEX-encoded is designed for compatibility issues and avoid SSE/AVX transition penalty [26]. The AVX technology has been released using the new VEX prefix scheme. Most SIMD instructions were redesigned for VEXencoded. The VEX-encoded instructions add features to SIMD instructions, such as non-destructive operands. VEX-encoded scheme also has been used to build the new Enhanced VEX(EVEX) -encoded scheme for AVX-512 technology. Sometimes, VEX-encoded instructions are mixed with non-VEX instructions which are the legacy SSE instructions. This mixture causes a significant performance penalty that should be avoided. It is quite easy to avoid VEX/non-VEX transition penalty. For this, programmer must use _mm256_zeroupper() instruction after each transition [37]. If compiler limits the use of this intrinsic function, the instruction can be put into the generated code directly using instructions such as __asm__ __volatile__ (vzeroupper : : :) [5]. In addition, there is no penalty for mixing any of VEX 128/256 or EVEX 128/256/512 on any current CPUs. Moreover, when the processor detects the Intel AVX instructions, additional voltage is applied to the core. This requirement makes the Central Processing Unit (CPU) hotter than usual which means frequency should be reduced to satisfy the thermal design power limits [67].

88

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

Table 4 AVX-512 subset extensions [37,39]. Abbreviation

Group

Description

AVX-512F

Fundamental

This subset has many basic instructions which can be used for most HPC, enterprise, multimedia, scientific and engineering applications. It has extended AVX and AVX2 instructions and added many new features. Thus, it would be the basic instruction set of any AVX3 support micro-processors. It contains many instructions, such as vectorized arithmetic operations, comparisons, type conversions, data movements, data permutations, bitwise logical operations on vectors/mask registers, and miscellaneous math functions like min/max. It is extended using the EVEX prefix which is built on the existing VEX prefix.

AVX512-CD

Conflict Detection

These instructions are used to test a vector element for equality to all other elements of the same vector. Currently, these instructions are implemented for 32-bit and 64-bit integers.

AVX512-ER

Exponential and Reciprocal

This subset contains instructions for base 2 exponential functions, reciprocals, and inverse square root suitable for scientific applications.

AVX512-PF

Prefetch

This subset is implemented for gather/scatter instructions to provide hints in the source code that improve the performance when a good memory access pattern is available.

AVX512DQ

Doubleword and Quadword

This subset has been implemented to cover missed instructions such as and, or and xor for double-precision and single-precision floating-point elements. It also adds many new conversion, broadcast, classifier, extraction and insertion instructions.

AVX-512BW

Byte and Word

This subset support many instructions for 8-bit and 16-bit integers and also provide many comparison instructions which set the mask registers.

AVX-512VL

Vector Length

It allows most AVX-512 instructions to operate on 128-bit and 256-bit vector length. This flag alone is never sufficient to determine that a given Intel AVX-512 instruction may be encoded at vector lengths smaller than 512 bits.

AVX-512VBMI

Vector Bit Manipulation

It adds few additional vector byte permutation and multi-shift instructions.

AVX3.1

F + CD + ER + PF

The term refers to a group of AVX-512 instructions (F + CD + ER + PF) which currently are implemented in Xeon Phi processors. It is mostly used for HPC workloads and super computers.

AVX3.2

F + CD + BW + DQ + VL

The term refers to a group of AVX-512 instructions (F + CD + BW + DQ + VL) which currently are implemented in Xeon Skylake and Skylake-x Corei9 and Corei7 processors. It is used for servers and desktops computers.

Table 5 A comparison of Intel SIMD features of SSEs, AVX, AVX2 and AVX3 extensions. Vector width Processor Lithography # of instructions # of operand Prefix Mask Reg. String Permute Broadcast Gather Scatter Compress Expand Conflict Floating-point class Floating-point scale Get exponent Get mantissas Reduce Bit rotation Ternary logic

SSEs

AVX

AVX2

AVX3

128-bit Pentium III to Nehalem 250 nm to 45 nm 313 3 – – 128-bit – – – – – – – – – – – – – –

256-bit Sandy Bridge 32 nm >85 4 VEX-Encoded – – 128/256-bit 128/256-bit – – – – – – – – – – – –

256-bit Haswell 22 nm >135 4 VEX-Encoded – – 256-bit 128/256128/256-bit – – – – – – – – – – –

512-bit Xeon Phi x200 14 nm >300 4 EVEX-Encoded 64-bit – 128/256/512-bit 128/256/512-bit 128/256/512-bit 128/256/512-bit 128/256/512-bit 128/256/512-bit 128/256/512-bit 128/256/512-bit 128/256/512-bit 128/256/512-bit 128/256/512-bit 128/256/512-bit 128/256/512-bit 128/256/512-bit

AVX2 permutations are not supported for 128-bit operands and a few instructions have been abandoned for 256-bit vector registers such as string instructions of SSE4.2 which have been provided with VEX-encoding for 128-bit registers. Furthermore, many new instructions have been added to AVX3, such as scatter to write the vector register elements into none-continuous memory addresses and ternary logic to implement any three-operand binary function. Permute instructions are added to AVX to shuffle elements across the lanes. Fig. 6 depicts vpermpd instruction which shuffles the double precision floating-point elements, a0–a3, according to the constant value in s8 to specify the target location of each source input. In other words, it uses (s1, s0), (s3, s2) and (s5, s6) as selectors of inputs and put them in d0, d1, d2 and d3, respectively. For example, if (s1 == 1 and s0 == 0) it will put a2 in d0.

Broadcast operations were designed as Intrinsic functions such as _mm_set1_ps(), which duplicates a single-precision floatingpoint value to all elements of the SSE vector without any particular broadcast instructions in the ISA. The AVX ISA provides broadcast instructions such as vbroadcastf128 for copying a 128bit vector register containing four floats or two doubles, to a 256-bit vector register. The AVX2 allows broadcasting a single element to all elements of a vector register. As depicted in Fig. 7, the vpbroadcastd instruction duplicates a 32-bit item from memory or a scalar register to all elements of a vector register. The AVX2 provides gather instructions to support fetching data from non-contiguous memory location using offset parameters. As depicted in Fig. 8 the vpgatherdd instruction gathers 32-bit integers from memory using 32-bit indices. By specifying all offsets equal to the row size, reading the elements from a

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

89

values and save power [5,41]. Furthermore, VEX-encoding has been enhanced to EVEX-encoding which supports vector length encoding up to 512-bit (zmm0–zmm31) and allows using of k1 to k7 opmask registers and z bit for most instructions such as vaddps zmm1 {k1}{z}, zmm2, zmm3 [39]. 4.2. Special purpose instructions

Fig. 6. Operation of vpermpd instruction.

Fig. 7. Broadcast instruction loads one 16-bit element from memory or scalar register into all elements of a 256-bit vector register.

column of the matrix with one single instruction is possible. In addition, the scale is equal to the size of the data type, the ymm1 or vindex contains indices for offsetting the computations and ymm2 is used to mask the gathered results, arbitrary. The AVX3 adds many new instructions such as scatter, compress, expand, conflict, floating-point classifier, floating-point scale, get exponent, get mantissa, reduce operation, bit rotation, bit-wise ternary logic and new conversions. It also provides mask operations using eight opmask registers of k0 to k7. This operation is designed for most AVX-512 instructions that allows masking off the excess vector elements when loop count is not divisible by the vector size. There are two types of masking (merging and zeroing) differentiated using the z bit. Masking is cost effective and designed to avoid exceptions and penalties for sub-normal

In this paper, Special Purpose Instructions (SPIs), are those computational instructions which are not available in the scalar mode as a single instruction. SPIs are implemented to perform a set of applicable operations using a single SIMD instruction. Since the primary goal of SIMD extensions is improving the performance of multimedia kernels, microprocessor vendors have provided SPIs, particularly for multimedia kernels [87]. Intel company has developed SPIs in each generation of SSE extensions. It has provided special and elementary mathematical instructions such as the Sum of Absolute Differences (SAD) (psadbw), maximum (*max*), minimum (*min*), square root (sqrt*), reciprocal square root (rsqrt*), reciprocal (rcp*) and average (pavg*) in SSE/SSE2, absolute value (pabs*) in SSSE3, SADs of quadruplets (mpsadbw) and round (round*) in SSE4.1, and string comparison (pcmp*str*) in SSE4.2 [25]. AVX and AVX2 have not added new SPIs while previous SPIs have been expanded and many new SPIs have been added to AVX3, such as get mantissa (vgetmant*), get exponent (vgetexp*), and reduce (vreduce*). The SSE2 provided an unsigned 8-bit SPI of SAD that can accelerate the motion estimation kernel at similarity measurement function of MPEG video codecs which is available in AVX2 and AVX3 instructions as well. As depicted in Fig. 9, the AVX2 provides a vpsadbw ymm0, ymm1, ymm2 instruction, which compute the absolute differences of packed unsigned 8-bit integers of ymm1 and ymm2, then horizontally adds the following eight absolute differences to produce four unsigned 16-bit integers, and pack these unsigned 16-bit integers into the low 16 bits of 64-bit elements in ymm0. For exploiting this instruction, the _mm256_sad_epu8 (__m256i, __m256i) intrinsic function is available. Such that, two 256-bit integer vectors participate as the input, and the function returns four packed 64-bit to a 256-bit vector. This operation is also available in AVX-512BW that extend the operation for 512-bit vector registers. 5. Documents, tools and SIMD programming models 5.1. Remarkable documents and tools Intel publishes many manuals and white papers, about SIMD architectures and programming, regularly. One of the most popuR lar Intel’s manuals is the Intel⃝ 64 and IA-32 Architectures Software Developer Manuals [44]. The manual is usually split into four documents that are available in the following volumes. Volume 1: Basic architecture [34] describes the architecture and programming

Fig. 8. Gather instruction loads different elements from non-contiguous memory addresses specified by offsets into the elements of a 256-bit vector register.

90

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

Fig. 9. Structure of the vpsadbw instruction which computes the SAD of unsigned 8-bit integers and packs the results into the low 16 bits of each 64-bit element of the vector register in AVX2 extension.

environment. Volume 2: Instruction Set Reference [45] describes the format of the instruction and provides reference pages for instructions. This manual orders instructions from A to Z. It is split in four sub-volume 2A, 2B, 2C and 2D [46–49]. The 2A, 2B, and 2C provide details for instructions from A to L, M to U and V to Z, respectively. The 2D volume describes the safer mode extensions that provides information about exceptions, encoding, machine code, etc. Volume 3: System Programming Guide [50] describes the operating-system support environment which is separated into four sub-volumes as well [51–54]. Volume 4: Model-Specific Registers [55] describes the model-specific registers in detail. R These documents are for processors supporting IA-32 and Intel⃝ 64 architectures. A combined version of all of these documents is provided in [28] and documentation changes are presented in [29]. Intel also publishes an optimization manual in [30] that describes code optimization techniques to tune applications for highly optimized results. The AVX-512 is a big step in Intel SIMD technologies that introduces many new instructions for different purposes. In [16], using AVX-512 for math function implementations has been described. Agner Fog provides some resources which are very useful in the case of optimization and realization [1]. There are five optimization manuals which are updated as needed. In [5,6] code optimization has been described for C++ and assembly languages, respectively. Details about micro-architectures are explained in [4] and specific information about instruction latencies, throughputs and micro-operation breakdowns is available in [3]. In [7] differences between various C++ compiler have been reported that can be used to make more compatible software. All these manual support x86 microprocessors from Intel, AMD, and VIA. There are some tools to analyze and profile a program. First, performance counters for Linux, which is well-known as perf Linux command, can be used to obtain many statistic information from a binary file. It is a kernel-based subsystem that provides a framework for collecting and analyzing performance data such as the number of cache misses, instructions, and CPU cycles in detail [75]. Second, in order to code analysis, Intel has R been releasing Intel⃝ Architecture Code Analyzer (IACA) [31,32] for Intel Core processors that allows the programmer to put IACA_START before the section and IACA_END after a section and see some details about the bottleneck, latency, throughput, etc. It supports Intel micro-architectures code named Sandy Bridge, Ivy Bridge, Haswell, Broadwell and Skylake on Windows, Linux and Mac OS X. Third, Valgrind as a dynamic binary instrumentation

framework is designed for building heavyweight dynamic binary analysis tools [69]. 5.2. SIMD programming models The basic concept of vectorization is to put data into vector elements and apply particular vector operations to compute the vector results. The number of vector elements shows how many operations can be performed simultaneously. Vector width, data type, number of vector registers and the complexity of algorithms are the significant factors that specify the vector elements and impact on the SIMD programming. In order to exploit the vector capability, the algorithm should be mapped to vector instructions. There are various ways of mapping an algorithm to vector instructions. Finding the most efficient implementation, from the performance point of view for a particular kernel is very important. In order to exploit the capability of SIMD extensions, different SIMD programming models which some of them are depicted in Table 6 have been proposed. In-line assembly is a low-level programming approach to exploit the assembly syntax of ISAs, which can be used to add SIMD instructions in high level languages such as C/C++. Intrinsics programming model provides very similar access to ISA’s capability as well as in-line assembly. However, IPM advises the compiler to map the function to SIMD instructions explicitly. Many researchers tried to develop an overall programming model for explicit SIMDization. For example, Vc is a free C++ library to provide portability between the various ISAs [61]. In [59] an interface based on Vc has been established which is named UME::SIMD. It allows the programmer to access the SIMD capabilities without the need for extensive knowledge of SIMD ISAs. UME::SIMD provides a simple, flexible and portable abstraction for explicit vectorization without performance losses compared to IPM. In the subsection, IPM and Compiler’s Automatic Vectorization (CAV) as explicit and implicit SIMD programming are explained, respectively. 5.2.1. Intrinsic programming model Both hardware and software vendors have been developing their products to exploit SIMD capability. Intrinsic Programming Model (IPM) is available in most modern compilers that can improve the performance of kernels and applications significantly. Intrinsic programming model simplifies programming compared to in-line assembly [22] because compiler optimization impresses

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

91

Table 6 Different SIMD exploitation tools. Programming approach

Description

In-line assembly [72]

The programmer can read and write C variables from assembler and move from assembler to C Codes. Exploiting SIMD technologies needs a good knowledge of programming with C and assembly.

IPM [37,42]

Intrinsic Programming Model (IPM) provides an interface that allows the programmers to write assembly style SIMD codes in a high-level language by Intrinsic functions which are similar to C/C++ functions. IPM expands in-line eliminating function call overhead, provides the same benefit as using in-line assembly, improves code readability, assists instruction scheduling, and helps reduce debugging. It refers to compilers features.

gSIMD [95]

gSIMD is a portable model which overloads most C/C++ operators for short vector abstraction. It provides several mapping from platforms particular Intrinsics to its simplified interface.

Vc [61]

The basic foundation of this free C++ library is to provide portability between the various ISAs as well as different compilers while its API can be used easily for explicit vectorization

VCL [2]

This library provides numbers of C++ classes that allow exploiting SIMD instructions explicitly without the need for using intrinsic functions.

Boost.SIMD [19]

Boost.SIMD is a high-level SIMD programming model. The primary goal of its API is providing a portable C++ template to vectorize computation on Altivec, SSE or AVX.

UME::SIMD [59]

It allows the programmer to access the SIMD capabilities without the need for extensive knowledge of SIMD ISAs. UME::SIMD provides a simple, flexible and portable abstraction for explicit vectorization without performance losses compared to IPM.

Sierra [66]

Sierra is a SIMD extension for C++. This API provides a portable and easy way that allows the programmer to exploit vectorization explicitly.

ISPC [76]

Intel SPMD Program Compiler (ISPC) is a way to exploit SPMD on SIMD architectures. This approach refers to both language and compiler. The language is a mix of C and C++ with additional keywords, and the compiler is built on top of the LLVM compiler infrastructure.

Cilk plus [81]

Intel Cilk Plus extends C and C++ to exploit vector parallelism commonly available in modern hardware and mainly supported by ICC. The #pragma simd enable SIMDization directly.

OpenMP [71]

OpenMP provides a portable and scalable model for developers of shared memory parallel applications. Compilers must support the OpenMP Application Program Interface (API). Exploiting SIMD is as simple as applying a #pragma omp simd to a loop to indicate that the loop can be transformed into a SIMD loop.

CAV [15,43,68]

Automatic generation of SIMD instructions is introduced in compilers. Compiler’s Automatic Vectorization (CAV) is one of the easiest ways to exploit SIMD capabilities without the need for extra programming effort. Most modern compilers such as Intel C++ Compiler (ICC), Gnu Compiler Collection (GCC), and Low-level Virtual Machine (LLVM) have the CAV capability. For instance, GCC’s auto vectorizer is enabled on -O3 optimization or using -ftree-vectorize in command line. However, limitations such as data misalignment, pointer aliasing, and data dependency, may be obstacles to CAV. In addition, loop level vectorization and super word level vectorization are two significant techniques of CAV.

the IPM codes and provides functionality to save the programmer’s power using a low-level programming model in a high-level programming language. IPM is preferred to in-line assembly why it allows the programmer to make a program for specific architecture, such as x86, and compile it for different micro-architectures which support the used SIMD extensions. It mostly improves the performance much higher than other SIMD programming models on a single core of the processors [77]. Programming with IPM needs an extensive knowledge of SIMD micro-architecture, moreover, portability, compatibility, and scalability must be planed. This is because many programming approaches have been developed to fill the gap in this area such as compiler’s automatic vectorization. 5.2.2. Compiler automatic vectorization Compiler’s Automatic Vectorization (CAV) is one of the most important ways for implicit vectorization that increases programmability, compatibility, and portability. Programmers can use CAVs easily, since, they rely on CAVs features to avoid handtuning vectorization challenges. Performance improvement of CAVs is a significant concern of high performance computing. Compiler developers do significant effort to embed new vectorization approaches in their compilers to lessen the programmer’s influences on SIMDization. For this purpose, Loop-Level Vectorization (LLV) [70] and Super-word Level Parallelism (SLP) are mainly used to afford the compiler’s automatic vectorization. The SLP is a restricted form of instruction-level parallelism. While, SLP execution is preferred because it provides a less expensive and more energy efficient solution. Basically, SLP identifies the adjacent memory to create new groups of statements. Then, it merges all groups according to super-word data-path width. Finally, packed statements are replaced with SIMD operations [64]. SLP

was designed to work within a basic block for simple loops. These optimizations performance is depended on the way that the source code is written. In other words, by using code modification techniques it is possible to help the compilers auto-vectorize kernels more efficiently [11] Although SIMD extensions are very simple from the hardware point of view, SIMD code generation has always been a challenging task. The CAV is an easy use approach, but, it has a complicated development path. In this era, a perfect compiler cannot be found and it is a unique opportunity for eager researchers to work and find an efficient vectorization approach which is suitable for data-level-parallelism. In this work, we implement some kernels using IPM approach which is simpler to in-line assembly method, yields a better performance than others, supported in the most modern compilers [22,77,92]. 6. Mapping kernels on SIMD extensions using IPM approach In order to show the suitability of SIMD extensions for high performance computing, we have SIMDized Finite Impulse Response (FIR) filter, Matrix transposition (TRA), Matrix–Matrix Multiplications (MMM), and a[i] = a[i − 1] (AIC), kernels using SIMD programming approaches such as IPM and CAVs. We have used AVX and AVX2 technologies as representative SIMD extensions in both IPM and CAVs. The straightforward implementations of these kernels are depicted in Table 7. The reasons to select these kernels are as follows. First, the FIR filter is a convolutionbase kernel which reads data from unaligned memory addresses that is important for vectorization. Second, matrix transposition reads data from non-continuous and write them into continues memory addresses. Third, matrix–matrix multiplication is a computation-intensive kernel which has different algorithms

92

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

Table 7 Straightforward implementations of kernels which have been used as the benchmarks.

Kernel

FIR

TRA

Scalar implementation

int i, j; for(i=0; i
for(i=0; i
Fig. 10. Intrinsic programming model implementation of FIR kernel.

A×B

A × BT

AIC (ai = ai−1 + c)

for(i=0; i
for(i=0; i
int A[N]; b=1; c=2; A[0]=b; for(i=1; i
that have different implementations. Finally, the ai = a(i−1) + c equation needs the adjacent element for each computation that might seem as an unvertorizable kernel. 6.1. Finite impulse response filter Finite impulse response filter is a common kernel in many multimedia applications such as pattern recognition and audio processing [86]. For SIMDization of the FIR filter, we vectorized the outer loop, which means that the inner loop computes one term of eight outputs simultaneously. As depicted in Fig. 10 we implemented this kernel using broadcast instructions. First, we implemented two traversal loops on N and M constants which N is the input array size, and M is equal to the number of coefficients. Second, we loaded input data and coefficients using aligned loading and broadcast instructions, respectively. Third, we read operands using an unaligned load and a broadcast instruction, respectively. Then the sum of multiplications is computed. Finally, we stored the results to the output. Fig. 11

Fig. 11. Data flow graph for vectorizing the outer loop of FIR filter which computes eight results simultaneously, which each element is 32-bit.

depicts how vectorized loop computes eight elements of the output matrix, simultaneously [9–11]. 6.2. Matrix transposition Transposing an M×M matrix using gather instruction is depicted in Fig. 12. We declared an array included offset maker numbers which are loaded to vindex vector, then we read an 8 × 8 block from the columns of the input matrix using gather instructions that use 32-bit indices stored in vindex. Each 32-bit element is loaded from the basic address by the offset which is equal to 4 ×v index bytes, respectively. The gathered elements are merged into the destination vector. In the next step, we store the vectors to the rows of the output by changing the matrix indices. Fig. 13 illustrates operations to transpose the matrix A into matrix AT that data are read from columns and stored in destination place. As depicted in Fig. 14, when each block is transposed we store the transposed block to the mirrored location of the output matrix. Briefly, the columns of each block are read using the gather instruction and stored in a vector register. Then, the vector register is written into the output matrix while the indices are exchanged. 6.3. Matrix–matrix multiplication There are many approaches for matrix–matrix multiplication kernel, we used C = A × B and C = A × BT approaches. As depicted in Fig. 15 we read eight elements of the first matrix from

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

93

Fig. 13. Data flow graph for vectorizing matrix transposition using gather instruction which read values from noncontinuous memory addresses and store them to continuous memory addresses.

Fig. 12. IPM implementation of matrix transposition. Columns are read using gather instruction and stored into AT matrix.

rows followed by gather instruction which loads eight elements from the columns of second matrix implement C = A × B as its data flow graph is depicted in Fig. 16. On the other hand, for C = A × BT implementation, we transposed the second matrix to perform row by row multiplications, which its IPM implementation is depicted in Fig. 17 and data flow graph of its vectorization is shown in Fig. 18. We used an in-line _mm256_hadd2_epi32 function which adds all elements of a vector horizontally for both approaches which depicted in Fig. 18.

Fig. 14. Data flow graph for matrix transposition using blocking approach. In order to transpose a matrix of 32-bit elements, the matrix is divided into different 8 times8 blocks.

6.4. Vectorization of a[i] = a[i−1] + c Sometimes a loop contains a true dependence such as a[i] = a[i − 1] + c which seems to be unvectorizable. In first glance, because the current value a[i] needs the prior value a[i − 1] it might be inferred that this operation is not vectorizable. In order to vectorize this operation, dependencies should be addressed. Fig. 19 depicts data flow graph for vectorizing this kernel. In order to show how to vectorize this kernel, some traces of Fig. 19 are depicted in Fig. 20. As can be seen, by replacing the preceding sentence of the sequence a new equation, ai = a0 + i ∗ c, is obtained in the top of the figure. Because the vectorization factor is eight, it needs to be added with 8 ∗ c in each step that results a new equation, ai = ai−8 + 8 ∗ c is obtained in the bottom of the figure. Fig. 21 depicts the IPM code of a[i] = a[i − 1] + c kernel. As can be seen, dependencies are solved in the first step. In the second step, computations are performed. This mapping

Fig. 15. IPM implementation for vectorization of C = A × B using gather instruction.

94

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

Fig. 16. Data flow graph for vectorization of C = A × B using gather instruction which its IPM implementation has been depicted in Fig. 15.

Fig. 19. Data flow graph for vectorization of a[i] = a[i − 1] + c that shows how data dependencies are removed.

Fig. 17. IPM implementation for vectorization of C = A × BT using matrix transposition.

Fig. 20. Some traces of Fig. 19 to show how a[i] = a[i−1]+c kernel is vectorized. Table 8 Platform specification. CPU Register width Cache line size L1 Data Cache Fig. 18. Data flow graph for vectorization of C = A × BT which its IPM implementation has been depicted in Fig. 17. L2 Cache L3 Cache

strategy is an example of SIMD vectorization method for loops which contains data dependency. 7. Performance evaluation In this section we evaluate the performance of both SIMD programming approaches, Intrinsic programming model and compiler’s automatic vectorization using the implemented multimedia kernels. 7.1. Environment setup As depicted in Table 8, our platform is based on a 2.60 GHz processor of Intel Corei7-6700 HQ with three level of data cache.

Operating system Compilers Programming tools Disable vectorizing

Intel Corei7- 6700HQ Maximum 256 bits 64 Bytes 32 kB, 8-way set associative, the fastest latency: 4 cycles, 2 × 32 B load + 1 × 32 B store 256 kB, 4-way set associative, the fastest latency: 12 cycles Up to 2 MB per core, up to 16-ways, the fastest latency: 44 cycles Fedora 30, 64-bit ICC 19.0.3, GCC 9.1.1 and LLVM 8.0.0 C and x86 Intrinsics (x86intrin.h) icc -O3 -no-vec gcc -O3 -fno-tree-vectorize -fno-tree-slp-vectorize clang -O3 -fno-vectorize -fno-slp-vectorize

Each core of the processor, a Skylake GPP, supports SIMD extensions up to AVX2 and has some vector ALU inside to issue SIMD instructions. We used ICC, GCC and LLVM to compile the programs at -O3 optimization level for integers and -Ofast optimization flag was enabled for floating-point operations to utilize CAVs abilities. In order to evaluate SIMDization on a single core, all implementations were performed on one reserved core

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

95

Table 9 Specification of SIMD instructions which are used to vectorize kernels. Place: Register to Memory (REM), Memory to Register (MER), Register to Register (RER); LA: Latency; TH: Throughput; common: All vectorization approaches;. Instruction

Description

Place

LA

vextractf128

Extract packed floating-point Extract packed integer values Insert packed floating-point Move aligned single-precision Move unaligned integer values

RER

3

1

RER

3

1

RER

3

1

MER

3

0.5

REM MER RER MER

3 3 1 4

1 0.5 0.25 1

REM MER MER RER MER RER RER

3 3 1 1 3 3 1

1/0.5

vextracti128 vinsertf128 vmovaps vmovdq(a/u)

vmovhpd vmovups vpadd(d/q) vpbroadcastd vpcmpeqd

vpextrd vpgatherdd vpgatherqd vpmulld

Fig. 21. IPM implementation for vectorization of a[i] = a[i − 1] + c kernel which has data dependency.

of eight available cores. For reserving the cores, isolcpus as a Linux kernel parameter was added to the boot options, properly. Both explicit and implicit vectorization of different compilers have been implemented and compared. The explicit vectorization includes IPM-AVX2 and IPM-SSE4 while implicit vectorization includes CAV-AVX2. The scalar implementation of ICC version of kernels has been used as our base reference and all comparisons have been compared to this implementation. In order to evaluate the impact of different matrix sizes on the performance, different matrix sizes from 256 × 256 to 2048 × 2048 have been used. In order to warm up the cache and avoid context switching, we executed all programs many times and measured the smallest execution time. When programs are compiled many times inside a do-while loop, it might show different optimization with and without the loop effects. We have noticed that appropriate strategies were used to avoid the mismatches. Moreover, we disabled auto-vectorization capability for IPM versions and baseline scalar implementations. In addition, the baseline scalar implementations were compiled using ICC to compare the performance of other implementations and compilations. Table 9 depicts technical information of some SIMD instructions which is used to vectorize kernels using explicit and implicit vectorization. The table shows description, operand address type, latency and throughput for each instruction. For example a vextractf128 instruction, extracts 128 bits, packed floating-point elements from a vector register and store them to another vector register that can be performed with three cycles of latency. The throughput of the instruction is one cycle [3]. 7.2. Experimental results of finite impulse response filter Fig. 22 depicts the speedups of different compilers for explicit and implicit vectorization and scalar implementations of GCC and

vpshufd/ vshufps vpsllq/ vpsrlq vpsrldq vpunpckldq

Move high packed double-precision Move unaligned single-precision Vector pairwise addition Load integer and broadcast Compare packed doublewords for equality Extract a dword integer values Packed gather Dword Dword from memory Packed gather Qword Dword from memory Packed multiply low Dword Shuffle packed doublewords Shift packed data left logical Shift double quadword right logical Packed unpack low Dword Qword

TH

0.5/0.33 0.5/1 0.5

REM

1

MER

5

MER

4

MER RER RER

10 1 1

1/1

RER

1

0.5

RER

1

1

RER

1

1

1

LLVM over scalar implementation of ICC version of finite impulse filter for different matrix sizes from 256 × 256 to 2048 × 2048. As can be seen in this figure the speedups of IPM-AVX2 and IPMSSE4 for all compilers are ranging from 3.28 to 6.64 and 2.84 to 3.56, respectively. In other words, the speedups for different compilers for IPM are almost the same. Speedups of the IPM-AVX2 are almost two times larger than IPM-SSE4 for small matrix sizes compared to larger matrix sizes. Speedups of CAV-AVX2 for ICC, GCC, and LLVM are different. These speedups are ranging from 1.36 to 1.38, 2.91 to 3.6 and 1.0 to 1.02 for ICC, GCC, and LLVM, respectively. Both ICC and GCC vectorize the kernel while LLVM compiler does not vectorize the kernel, therefore performance is not improved. The speedups of CAV-GCC are larger than CAV-ICC. In addition, both IPM and CAV of ICC and GCC speedups for larger matrix sizes are less than small matrix sizes because of memory bottlenecks. In order to show the reasons for these differences between IPM and CAV speedups, a part of the generated SIMD code for the most inner loop for both IPM-AVX2 and IPM-SSE4 and CAV of ICC and GCC are depicted in Figs. 23–25, respectively. As Fig. 23 shows FIR kernel is efficiently mapped to 128- and 256-bit registers and SIMD instructions in both SSE4 and AVX2 technologies, respectively. In addition, in IPM-AVX2 vpbroadcastd instruction is used while there is no such instruction in SSE4. From comparison between Figs. 24 and 25, it can be seen that auto-vectorization of ICC vectorizes the kernel using both vector registers of 128- and 256-bit, while auto-vectorization of GCC just uses 256-bit vector registers. In addition, vectorization in ICC

96

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

Fig. 22. Speedups of ICC, GCC, and LLVM for IPM-AVX2, IPM-SSE4 and CAV-AVX2 and scalar implementations of GCC and LLVM over scalar implementation of ICC version of finite impulse filter for different matrix sizes.

Fig. 25. A part of the SIMD code generated using auto-vectorization of GCC for FIR filter.

registers outside the inner loop and computations are performed inside the loop. These are the reasons for why the speedups of auto-vectorization of GCC are larger than ICC. 7.3. Experimental results of matrix transposition Fig. 23. A part of the SIMD code generated by IPM implementation of FIR kernel using AVX2 and SSE4 technologies.

Figs. 26 and 27 depict the speedups of different compilers for explicit and implicit vectorization and scalar implementations of GCC and LLVM over scalar implementation of ICC version of matrix transposition kernel for integer and floating-point numbers for different matrix sizes from 256 × 256 to 2048 × 2048, respectively. In explicit vectorization, there are three versions, IPM-AVX2-Shuffle, IPM-AVX2-Gather, and IPM-SSE4. In Fig. 26 the speedups of the IPM-AVX-Shuffle and IPM-AVX2-Gather are almost the same and they are ranging from 2.84 to 5.55, 2.91 to 5.56 and 2.86 to 5.58 for ICC, GCC and LLVM, respectively. The speedups for IPM-SSE4 are ranging from 1.52 to 3.19, 1.55 to 3.83, and 1.51 to 3.25 for ICC, GCC and LLVM, respectively. The behavior of both figures is almost the same. The speedups are increasing when matrix sizes are increased. On the other hand, compilers automatic vectorization behaves completely different than each other. For example, in ICC auto-vectorization, unpack instructions have been generated using 128-bit vector register in an inefficient manner. In GCC auto vectorization, extract instructions have been generated using the combination of 128- and 256-bit vector registers. LLVM does not vectorize the code. The reason why for some image sizes LLVM speedups are less than one is that scalar reference implementation of ICC version is much faster than LLVM scalar implementation. 7.4. Experimental results of matrix–matrix multiplication

Fig. 24. A part of the SIMD code generated using auto-vectorization of ICC for FIR filter.

has more overhead than GCC because of using vextractf128 and vpsrldq rearrangement instructions to reorder the multiplication results, while there are no such instructions in GCC codes. Furthermore, GCC broadcasts the coefficients to vector

Figs. 28 and 29 depict the speedups of different compilers for matrix–matrix multiplication kernel over scalar implementation of ICC version using straightforward implementation (MMMAB) for both integer and floating-point numbers, respectively. As these figures show IPM has been implemented using gather instruction. All speedups are almost less than one except for autovectorization in GCC which is ranging from 3.64 to 5.3 for both integer and floating-point numbers. The main reason for this is

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

97

Fig. 26. Speedups of ICC, GCC, and LLVM for IPM-AVX2, IPM-SSE4 and CAV-AVX2 and scalar implementations of GCC and LLVM over scalar implementation of ICC version of matrix transposition kernel for integer numbers for different matrix sizes.

Fig. 30. An example to show how GCC auto-vectorization vectorizes the matrix–matrix multiplication kernel using broadcast instruction.

Fig. 27. Speedups of ICC, GCC, and LLVM for IPM-AVX2, IPM-SSE4 and CAV-AVX2 and scalar implementations of GCC and LLVM over scalar implementation of ICC version of matrix transposition kernel for floating-point numbers for different matrix sizes.

Fig. 28. Speedups of ICC, GCC, and LLVM for IPM-AVX2 and CAV-AVX2 over scalar implementation of ICC version of matrix–matrix multiplication kernel for integer numbers using straightforward implementation (MMM-AB) for different matrix sizes.

Fig. 29. Speedups of ICC, GCC, and LLVM for IPM-AVX and CAV-AVX over scalar implementation of ICC version of matrix–matrix multiplication kernel for floating-point numbers using straightforward implementation (MMM-AB) for different matrix sizes.

that GCC maps the algorithm completely different than others. In addition, GCC uses broadcast instruction, while ICC and LLVM do not. In order to clarify more in detail, we explain the way of GCC auto-vectorization which applied for matrix–matrix multiplication using two matrices of size 8 × 8 in Fig. 30 using broadcast instruction. In this figure, for simplicity 4-way datalevel parallelism has been used. Each element of the first matrix’s row is repeated four times in vector registers using broadcast instruction. These vector registers are multiplied with elements of second matrix’s rows and obtained results are considered as intermediate results. After that all these intermediate results are

Fig. 31. Speedups of ICC, GCC, and LLVM for IPM-AVX2 and CAV-AVX2 over scalar implementation of ICC version of matrix–matrix multiplication kernel for integer numbers using matrix transposition (MMM-ABT) for different matrix sizes.

Fig. 32. Speedups of ICC, GCC, and LLVM for IPM-AVX and CAV-AVX over scalar implementation of ICC version of matrix–matrix multiplication kernel for floating-point numbers using matrix transposition (MMM-ABT) for different matrix sizes.

summed with each other and four outputs are computed simultaneously. As can be seen this technique accesses the elements in row-wise manner and four output elements are computed. These are the reasons that the GCC auto-vectorization has advantages compared to gather and rearrangement instructions which have been used in ICC and LLVM. IPM implementations of MMM-AB-int and MMM-AB-float kernels are not efficient which have been implemented using gather instructions to load data in non-consecutive memory locations. Their performance has been improved using changing the mapping strategies from straightforward implementation to transpose implementation in MMM-ABT -int and MMM-ABT -float kernels. Figs. 31 and 32 depict the speedups of different compilers for matrix–matrix multiplication kernel over scalar implementation of ICC version of both integer and floating-point numbers using transposition implementation (MMM-ABT), respectively. As these figures show the speedups for integer numbers, IPM-AVX2 and IPM-AVX are ranging from 1.51 to 4.39 and 1.39 to 3.67, respectively. These speedups for floating-point numbers, CAVAVX2 and CAV-AVX are ranging from 1.48 to 3.9 and 1.42 to 4.48, respectively. All these speedups show that all compilers behave almost the same for both IPM and CAV implementations.

98

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References Fig. 33. Speedups of ICC, GCC, and LLVM for IPM-AVX2, CAV-AVX2 and scalar implementations of GCC and LLVM over scalar implementation of ICC version of a[i] = a[i − 1] + c kernel for different matrix sizes.

7.5. Experimental results of a[i] = a[i-1] + c Fig. 33 depicts the speedups of different compilers for explicit and implicit vectorization and scalar implementations of GCC and LLVM over scalar implementation of ICC version of a[i] = a[i − 1] + c kernel for different matrix sizes from 256 × 256 to 2048 × 2048. The performance improvement of AIC-int kernel on ICC is almost lower than others. Both ICC and GCC do not vectorize the equation. On the other hand, LLVM not only vectorizes the equation but also does improve the performance significantly, specially for small matrix sizes. Generated assembly code shows that LLVM optimized the code more efficiently compared to others. LLVM realizes that the results of the current iterations will be used for the next iteration. Thus, vector registers containing the results elements are written to memory and moved to another register for future computations, while, ICC writes the results to memory and read the same address in the next iteration. This means that the way of mapping algorithms on SIMD architectures impacts on performance and finding a suitable mapping strategy and data structure are a challenging problem for programmers. 8. Conclusions Single Instruction Multiple Data (SIMD) technology which have been extending by processor vendors provides vector form computations to increase the performance of applications. Intel has introduced different SIMD extensions from MMX to AVX with vector registers from 64-bit to 512-bit. Each extension provides different SIMD arithmetic, logical, rearrangement and specialpurpose instructions. SIMD concept and extensions have been shortly reviewed in this paper. In order to exploit these SIMD technologies many SIMDization approaches such as implicit and explicit vectorization have been developed. The easiest is Compiler’s Automatic Vectorization (CAV) as implicit vectorization that compiler is responsible to map and to generate efficient SIMD codes, while it is a challenging task for different compilers such as ICC, GCC, and LLVM. Each compiler uses different vectorization techniques to vectorize an algorithm and their generated SIMD codes are different. In addition, the way of implementing a kernel in terms of algorithm and data structure impacts on vectorized codes. Our experimental results showed that autovectorizer of GCC and ICC performs better than LLVM. Generally, CAVs cannot generate efficient vectorized codes in some cases and they need more and more development efforts to improve their abilities to vectorize different algorithms, efficiently. Intrinsic Programming Model (IPM) as explicit vectorization is also other way to exploit SIMD capabilities and programmers are responsible to map algorithms on SIMD extensions. The programmer’s knowledge about SIMD architectures, algorithms, mapping strategies has a direct impact to performance improvement. Our experimental results show that different compilers almost show the same behavior for IPM’s implementations, while programming with IPM is boring and error-prone and it needs more programming efforts compared to CAVs.

[1] Agner, Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X. URL http://www.agner.org/optimize/. [2] Agner Fog, VCL C++ vector class library, Tech. rep. URL http://www.agner. org/optimize/vectorclass.pdf, 2017. [3] Agner Fog, Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, Tech. rep. Technical University of Denmark. URL http://www.agner.org/optimize/ instruction_tables.pdf, 2017. [4] Agner Fog, The microarchitecture of Intel, AMD and VIA CPUs An optimization guide for assembly programmers and compiler makers, Tech. rep. URL http://www.agner.org/optimize/microarchitecture.pdf, 2017. [5] Agner Fog, Optimizing subroutines in assembly language: An optimization guide for x86 platforms, Tech. rep., URL http://www.agner.org/optimize/ optimizing_assembly.pdf, 2017. [6] Agner Fog, Optimizing software in C ++: An optimization guide for Windows, Linux and Mac platforms, Tech. rep. Technical University of Denmark. URL http://www.agner.org/optimize/optimizing_cpp.pdf, 2017. [7] Agner Fog, Calling conventions: for different C++ compilers and operating systems, Tech. rep. URL http://www.agner.org/optimize/calling_ conventions.pdf, 2017. [8] AMD, 3DNow! Technology Manual, Tech. rep. (2000). [9] H. Amiri, A. Shahbahrami, High performance implementation of 2-D convolution usingss AVX2, in: 19th IEEE International Symposium on Computer Architecture and Digital Systems, 2017, pp. 1–4. [10] H. Amiri, A. Shahbahrami, High performance implementation of 2D convolution using intel’s advanced vector extensions, in: IEEE International Conference on Artificial Intelligence and Signal Processing, 2017, pp. 25–30. http://dx.doi.org/10.1109/AISP.2017.8324097. [11] H. Amiri, A. Shahbahrami, A. Pohl, B.H. Juurlink, Performance evaluation of implicit and explicit SIMDization, Microprocess. Microsyst. 63 (2018) 158–168. [12] M.A. Arslan, F. Gruian, K. Kuchcinski, A. Karlsson, Code generation for a SIMD architecture with custom memory organisation, in: 2016 Conference on Design and Architectures for Signal and Image Processing (DASIP), IEEE, 2016, pp. 90–97, http://dx.doi.org/10.1109/DASIP.2016.7853802. [13] P. Bannon, Y. Saito, The alpha 21164pc microprocessor, in: Proceedings IEEE COMPCON 97. Digest of Papers, IEEE Comput. Soc. Press, 1997, pp. 20–27, http://dx.doi.org/10.1109/CMPCON.1997.584665. [14] D. Cheresiz, B. Juurlink, S. Vassiliadis, Performance Benefits of SpecialPurpose Instructions in the CSI Architecture. [15] Clang, clang: a C language family frontend for LLVM. URL https://clang. llvm.org/. R [16] M. Cornea, Intel⃝ AVX-512 Instructions and Their Use in the Implementation of Math Functions, Tech. rep. 2015. [17] K. Diefendorff, P. Dubey, R. Hochsprung, H. Scale, AltiVec extension to PowerPC accelerates media processing, IEEE Micro 20 (2) (2000) 85–95, http://dx.doi.org/10.1109/40.848475. [18] A. Estebanez, D.R. Llanos, A. Gonzalez-Escribano, A survey on threadlevel speculation techniques, ACM Comput. Surv. 49 (2) (2016) 1–39, http://dx.doi.org/10.1145/2938369. [19] P. Esterie, J. Falcou, M. Gaunard, J.-T. Laprest, Boost.SIMD: Generic programming for portable SIMDization, in: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, 2014, pp. 1–7, http: //dx.doi.org/10.1145/2568058.2568063. [20] N. Firasta, M. Buxton, P. Jinbo, K. Nasri, S. Kuo, Intel AVX: New Frontiers in Performance Improvements and Energy Efficiency, White Paper (2008) 1–9. [21] P. Gepner, V. Gamayunov, D.L. Fraser, Early performance evaluation of AVX for HPC, Procedia Comput. Sci. 4 (2011) 452–460, http://dx.doi.org/ 10.1016/j.procs.2011.04.047. [22] S.A. Hassan, A. Hemeida, M.M. Mahmoud, Performance evaluation of matrix-matrix multiplications using intel’s advanced vector extensions (AVX), Microprocess. Microsyst. 47 (2016) 369–374, http://dx.doi.org/10. 1016/j.micpro.2016.10.002. [23] J.L. Hennessy, D.A. Patterson, In Praise of Computer Architecture : A Quantitative Approach, 2011. [24] IBM, Synergistic Processor Unit Instruction Set Architecture Synergistic Processor Unit. [25] Intel Corporation, Intel SSE4 Programming Reference (D91561-001). [26] Intel Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1 3 (253668-039US).

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100 [27] Intel Corporation, Intel Advanced Vector Extensions Programming Reference (319433-011). R [28] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4 (325462-063US). URL https://software.intel.com/sites/default/files/ managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf. R [29] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Documentation Changes (252046-055). 0000. URL https://software.intel.com/sites/default/files/managed/3e/79/252046-sdmchange-document.pdf. R [30] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Optimization Reference Manual (248966-037). URL https://software.intel.com/sites/default/ files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf. R [31] Intel Corporation, Intel⃝ Architecture Code Analyzer 2.3 (321356001US). URL https://software.intel.com/sites/default/files/managed/29/78/ intel-architecture-code-analyzer-2.3-users-guide.pdf. [32] Intel Corporation, Intel Architecture Code Analyzer for Intel AVX Instruction Set (321356-001US). doi:321356-001US. [33] Intel Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual (252046-049). R [34] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture 1 (253665-063US). URL https://software.intel.com/sites/default/files/managed/a4/60/253665sdm-vol-1.pdf. [35] Intel Corporation, Intel Architecture Instruction Set Extensions Programming Reference (319433-026). [36] Intel Corporation, Intel Architecture Instruction Set Extensions Programming Reference (319433-023). [37] Intel Corporation, Intel Intrinsics Guide. URL https://software.intel.com/ sites/l{and}ingpage/IntrinsicsGuide/. [38] Intel Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture 1 (253665-037US). [39] Intel Corporation, Intel Architecture Instruction Set Extensions Programming Reference (319433-029). R [40] Intel Corporation, Intel⃝ Product Specifications. URL https://ark.intel.com/. R [41] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture 1 (253665-060US). [42] Intel Corporation, Intel C ++ Intrinsic Reference (312482-003US). [43] Intel Corporation, Intel C++ Compiler User and Reference Guides (304968-022US). DOI: 304968-022US. R [44] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer R Manuals | Intel⃝ Software. URL https://software.intel.com/en-us/articles/ intel-sdm. R [45] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B, 2C & 2D): Instruction Set Reference, A-Z 2 (325383-063US). URL https://software.intel.com/sites/default/files/ managed/a4/60/325383-sdm-vol-2abcd.pdf. R [46] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 2A: Instruction Set Reference, A-L 2 (253666063US). URL https://software.intel.com/sites/default/files/managed/ad/01/ 253666-sdm-vol-2a.pdf. R [47] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 2B: Instruction Set Reference, M-U 2 (253667063US). URL https://software.intel.com/sites/default/files/managed/7c/f1/ 253667-sdm-vol-2b.pdf. R [48] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 2C: Instruction Set Reference, V-Z 2 (326018063US). URL https://software.intel.com/sites/default/files/managed/7c/f1/ 326018-sdm-vol-2c.pdf. R [49] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 2D: Instruction Set Reference 2 (334569063US). URL https://software.intel.com/sites/default/files/managed/7c/f1/ 334569-sdm-vol-2d.pdf. R [50] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide 3 (325384-063US). URL https://software.intel.com/sites/default/files/ managed/a4/60/325384-sdm-vol-3abcd.pdf. R [51] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1 3 (253668063US). URL https://software.intel.com/sites/default/files/managed/7c/f1/ 253668-sdm-vol-3a.pdf. R [52] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2 3 (253669063US). URL https://software.intel.com/sites/default/files/managed/7c/f1/ 253669-sdm-vol-3b.pdf. R [53] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 3C: System Programming Guide, Part 3 3 (326019063US). URL https://software.intel.com/sites/default/files/managed/7c/f1/ 326019-sdm-vol-3c.pdf.

99

R [54] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 3D: System Programming Guide, Part 4 3 (332831063US). URL https://software.intel.com/sites/default/files/managed/7c/f1/ 332831-sdm-vol-3d.pdf. R [55] Intel Corporation, Intel⃝ 64 and IA-32 Architectures Software Developer’s Manual Volume 4: Model-Specific Registers 4 (335592-063US). URL https://software.intel.com/sites/default/files/managed/22/0d/335592sdm-vol-4.pdf. [56] T. Jain, T. Agrawal, The haswell microarchitecture - 4th generation processor, Int. J. Comput. Sci. Inf. Technol. 4 (3) (2013) 477–480, http://dx.doi. org/10.1109/MM.2014.10. [57] M. Jennings, T. Coate, Subword extensions for video processing on mobile systems, IEEE Concurr. 6 (3) (1998) 13–16, http://dx.doi.org/10.1109/4434. 708250. [58] B. Juurlink, A. Shahbahrami, S. Vassiliadis, Avoiding data conversions in embedded media processors, in: 2005 ACM Symposium on Applied Computing, 2005, pp. 901–902, http://dx.doi.org/10.1145/1066677. 1066883. [59] P. Karpiński, J. McDonald, A high-performance portable abstract interface for explicit SIMD vectorization, in: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM’17, ACM Press, New York, New York, USA, 2017, pp. 21–28, http://dx.doi.org/10.1145/3026937.3026939. [60] C.G. Kim, J.G. Kim, D.H. Lee, Optimizing image processing on multi-core CPUs with intel parallel programming technologies, Multimedia Tools Appl. 68 (2) (2011) 237–251, http://dx.doi.org/10.1007/s11042-011-0906-y. [61] M. Kretz, V. Lindenstruth, Vc: A C++ library for explicit vectorization, Softw. - Pract. Exp. 42 (11) (2012) 1409–1430, http://dx.doi.org/10.1002/spe.1149. [62] O. Krzikalla, G. Zitzlsberger, in: Proceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing - WPMVP ’16, New York, New York, USA. http://dx.doi.org/10.1145/2870650.2870655. [63] S. Ladra, O. Pedreira, J. Duato, N.R. Brisaboa, Exploiting SIMD instructions in current processors to improve classical string algorithms, in: T. Morzy, T. Härder, R. Wrembel (Eds.), Advances in Databases and Information Systems, in: Lecture Notes in Computer Science, Vol. 7503, 2012, pp. 254–267, http://dx.doi.org/10.1007/978-3-642-33074-2_19. [64] S. Larsen, S. Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets, (Ph.D. thesis), 2000. http://dx.doi.org/10.1145/ 358438.349320. [65] R. Lee, Subword parallelism with MAX-2, IEEE Micro 16 (4) (1996) 51–59, http://dx.doi.org/10.1109/40.526925. [66] R. Leißa, I. Haffner, S. Hack, Sierra: A SIMD extension for C++, in: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, 2014, pp. 17–24, http://dx.doi.org/10.1145/2568058.2568062. [67] G. Lento, Optimizing Performance with Intel Advanced Vector Extensions, White Paper. [68] D. Naishlos, Autovectorization in GCC, in: Proceedings of the 2004 GCC Developers Summit, 2004, pp. 105–118. URL ftp://gcc.gnu.org/pub/gcc/ summit/2004/Autovectorization.pdf. [69] N. Nethercote, J. Seward, Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation, Vol. 42, ACM, 2007, pp. 89–100. [70] D. Nuzman, A. Zaks, Outer-loop vectorization, in: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques - PACT ’08, ACM press, New York, New York, USA, 2008, p. 2, http://dx.doi.org/10.1145/1454115.1454119. [71] OpenMP Architecture Review Board, OpenMP Application Programming Interface, Tech. Rep. (2015). [72] Oracle Corporation, x86 Assembly Language Reference Manual, Oracle (817–5477–11). DOI: 817–5477–11. [73] A. Peleg, U. Weiser, MMX technology extension to the intel architecture, IEEE Micro 16 (4) (1996) 42–50, http://dx.doi.org/10.1109/40.526924. [74] S.J. Pennycook, C.J. Hughes, M. Smelyanskiy, S. Jarvis, Exploring SIMD for molecular dynamics, using intel xeon processors and intel xeon phi coprocessors, in: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, IEEE, Boston, MA, 2013, pp. 1085–1097, http: //dx.doi.org/10.1109/IPDPS.2013.44. [75] Perf Wiki, perf: Linux Profiling with Performance Counters (2017). URL https://perf.wiki.kernel.org/index.php/Main_Page. [76] M. Pharr, W.R. Mark, ispc: A SPMD Compiler for high-performance CPU programming, in: 2012 Innovative Parallel Computing (InPar), IEEE, 2012, pp. 1–13, http://dx.doi.org/10.1109/InPar.2012.6339601. [77] A. Pohl, B. Cosenza, M.A. Mesa, C.C. Chi, B. Juurlink, An evaluation of current SIMD programming models for C++, in: Proceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing - WPMVP ’16, 2016, pp. 1–8, http://dx.doi.org/10.1145/2870650.2870653. [78] S. Raman, V. Pentkovski, J. Keshava, Implementing streaming SIMD extensions on the pentium III processor, IEEE Micro 20 (4) (2000) 47–57, http://dx.doi.org/10.1109/40.865866. [79] R. Ramanathan, R. Curry, S. Chennupaty, R.L. Cross, S. Kuo, M.J. Buxton, Extending the World’s Most Popular Processor Architecture, White Paper.

100

H. Amiri and A. Shahbahrami / Journal of Parallel and Distributed Computing 135 (2020) 83–100

[80] P. Ranganathan, S. Adve, N. Jouppi, Performance of image and video processing with general-purpose processors and media ISA extensions, in: Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367), (003604), IEEE Comput. Soc. Press, 1999, pp. 124–135, http://dx.doi.org/10.1109/ISCA.1999.765945. [81] A.D. Robison, Composable parallel patterns with intel cilk plus, Comput. Sci. Eng. 15 (2) (2013) 66–71, http://dx.doi.org/10.1109/MCSE.2013.21. [82] L. Seiler, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, Larrabee, ACM Trans. Graph. 27 (3) (2008) 1, http: //dx.doi.org/10.1145/1360612.1360617. [83] A. Shahbahrami, Avoiding Conversion and Rearrangement Overhead in SIMD Architectures, (Ph. d. dissertation, Ph. D. dissertation), Computer Engineering Laboratory, Delft University of Technology, Delft, Netherlands, 2008, http://dx.doi.org/10.1007/s10766-006-0015-0. [84] A. Shahbahrami, B. Juurlink, Performance improvement of multimedia kernels by alleviating overhead instructions on SIMD devices, in: International Workshop on Advanced Parallel Processing Technologies, 2009, pp. 389–407. [85] A. Shahbahrami, B. Juurlink, S. Vassiliadis, A comparison between processor architectures for multimedia application, in: Proceedings of the 15th Annual Workshop on Circuits, Systems and Signal Processing, 2004, pp. 138–152. [86] A. Shahbahrami, B. Juurlink, S. Vassiliadis, Efficient vectorization of the FIR filter, in: 16th Annual Workshop on Circuits, Systems and Signal Processing, 2005, pp. 432–437. [87] A. Shahbahrami, B. Juurlink, S. Vassiliadis, Limitations of special-purpose instructions for similarity measurements in media SIMD extensions, in: Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems - CASES ’06, ACM Press, New York, New York, USA, 2006, p. 293. [88] N.T. Slingerland, A.J. Smith, Multimedia Instruction Sets for General Purpose Microprocessors: A Survey, Tech. rep. University of California, Berkeley, California, Report No. UCB/CSD-00-1124, 2000. [89] N. Slingerland, A. Smith, Measuring the performance of multimedia instruction sets, IEEE Trans. Comput. 51 (11) (2002) 1317–1332, http://dx. doi.org/10.1109/TC.2002.1047756. R [90] A. Sodani, Knights landing (KNL): 2nd generation intel⃝ xeon phi processor, in: 2015 IEEE Hot Chips 27 Symposium (HCS), IEEE, 2015, pp. 1–24, http://dx.doi.org/10.1109/HOTCHIPS.2015.7477467. [91] D. Talla, L. John, D. Burger, Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements, IEEE Trans. Comput. 52 (8) (2003) 1015–1031, http://dx.doi.org/10.1109/TC.2003.1223637. [92] X. Tian, H. Saito, S.V. Preis, E.N. Garcia, S.S. Kozhukhov, M. Masten, A.G. Cherkasov, N. Panchenko, Effective SIMD vectorization for intel xeon phi coprocessors, Sci. Program. 2015 (2015) 1–14, http://dx.doi.org/10.1155/ 2015/269764. [93] Niraj J. Tiwari, Pujashree S. Vidap, Vectorization on intel xeon phi: A survey approach, Int. J. Sci. Technol. Eng. 2 (10) (2016) 855–858, URL http://www.ijste.org/articles/IJSTEV2I10255pdf.

[94] M. Tremblay, J. O’Connor, V. Narayanan, Liang He, VIS speeds new media processing, IEEE Micro 16 (4) (1996) 10–20, http://dx.doi.org/10.1109/40. 526921. [95] H. Wang, P. Wu, I.G. Tanase, M.J. Serrano, J.E. Moreira, Simple, portable and fast SIMD intrinsic programming: Generic SIMD library, in: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, 2014, pp. 9–16, http://dx.doi.org/10.1145/2568058.2568059. [96] M. Youssfi, O. Bouattane, M.O. Bensalah, A massively parallel virtual machine for SIMD architectures, Adv. Stud. Theor. Phys. 9 (5) (2015) 237–243, http://dx.doi.org/10.12988/astp.2015.519. [97] A.S. Zekri, Enhancing the matrix transpose operation using intel avx instruction set extension, Int. J. Comput. Sci. Inf. Technol. 6 (3) (2014) 67–78, http://dx.doi.org/10.5121/ijcsit.2014.6305. [98] H. Zhou, J. Xue, Exploiting mixed SIMD parallelism by reducing data reorganization overhead, in: Proceedings of the 2016 International Symposium on Code Generation and Optimization - CGO 2016, ACM Press, New York, New York, USA, 2016, pp. 59–69, http://dx.doi.org/10.1145/2854038. 2854054.

Hossien Amiri received the B.Sc. degree in computer engineering in Iran and he is currently an M.Sc. student in the field of computer engineering under supervision of Dr. Asadollah Shahbahrami in Department of Computer Engineering Faculty of Engineering University of Guilan, Rasht, Iran. [email protected].

Asadollah Shahbahrami received the B.Sc. and M.Sc. degrees in computer engineering from Iran University of Science and Technology and Shiraz University in Iran, respectively. He got his Ph.D. degree in computer engineering in 2008 from Delft University of Technology, Electrical Engineering Mathematics and Computer Science, Computer Engineering Laboratory under supervision of Prof. Stamatis Vassiliadis and Ben Juurlink in the Netherlands. Currently, he has an associate Prof. position in Department of Computer Engineering, Faculty of Engineering, University of Guilan, Rasht, Iran. His research interests include Advanced Computer Architectures, SIMD Programming, Reconfigurable Architectures, Multimedia Processors, Content-based Image and Video retrieval, Multimedia Instruction Set Design, Multimedia Database Management, Embedded Systems Design, Digital Image and Video Processing. [email protected].