Low latency and division free Gauss–Jordan solver in floating point arithmetic

Low latency and division free Gauss–Jordan solver in floating point arithmetic

Accepted Manuscript Low latency and division free Gauss-Jordan solver in floating point arithmetic Jean Pierre David PII: DOI: Reference: S0743-7315(...

673KB Sizes 0 Downloads 24 Views

Accepted Manuscript Low latency and division free Gauss-Jordan solver in floating point arithmetic Jean Pierre David PII: DOI: Reference:

S0743-7315(16)30189-7 http://dx.doi.org/10.1016/j.jpdc.2016.12.013 YJPDC 3586

To appear in:

J. Parallel Distrib. Comput.

Received date: 16 June 2016 Revised date: 22 November 2016 Accepted date: 6 December 2016 Please cite this article as: J.P. David, Low latency and division free Gauss-Jordan solver in floating point arithmetic, J. Parallel Distrib. Comput. (2016), http://dx.doi.org/10.1016/j.jpdc.2016.12.013 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Highlights (for review)

Highlights for paper “Low Latency and Division Free Gauss-Jordan Solver in Floating Point Arithmetic” • • • • •

Low latency solvers are necessary for real time applications (simulation/control) The divider circuits used in most previous works induce long latencies We propose a division free parallel architecture adapted to floating point arithmetic We obtain two orders of magnitude gains compared to previous works 100-equation systems can be solved under 10 microseconds

Manuscript Click here to view linked References

Low Latency and Division Free Gauss-Jordan Solver in Floating Point Arithmetic Jean Pierre David1 Polytechnique Montr´eal

Abstract In many applications, the solution of a linear system is computed with Gaussian elimination followed by back-substitution, or Gauss-Jordan elimination. The latter is intrinsically more parallel, enabling smaller computing latencies at the price of more complex hardware. However both methods require the division operator, which leads to a time-consuming resource in the critical path of the algorithms and impacts the global processing’s latency. Jordan was already aware of a division free algorithm. However, its implementation involves multiplications at each step and the size of the numbers rapidly becomes too big for an efficient implementation of large systems. In this work, we present a small modification to the division free algorithm in order to keep the size of the numbers in a reasonable range for standard floating point numbers. This is possible thanks to the special format of floating point numbers, which enables error free and hardware efficient divisions by powers of two. We also propose a parallel and pipelined architecture that best exploits the proposed algorithm, including partial pivoting. We specially focus Corresponding author: Jean Pierre David, C.P. 6079, succursale Centre-ville, Montr´eal, QC, Canada, H3C 3A7, Tel +1 514 340 4711 p2009, Email:[email protected] 1

Preprint submitted to Parallel and Distributed Computing

November 22, 2016

on the global latency of the system as a function of its size, the latency of the floating point operators, and the number of operators that are available. Results demonstrate that current FPGAs can solve linear systems larger than hundred equations within ten microseconds. This represents a two order of magnitude improvement over previous implementations for relatively small systems. Keywords: Gaussian Elimination, Gauss-Jordan, Linear Solver, Low Latency Solver, Real Time Simulation 1. Introduction Solving a system of linear equations Ax = b, where A is a matrix of size N × N and b is vector of size N , is a problem that was already addressed in ancient China, the first century BC. A method similar to Gaussian elimination was known at that time and described in a manuscript named ”The nine chapters” (Shen et al., 1999). The method is still in use today in many applications, under various implementations and with some optimizations, including recent works such as (Chowdhury et al., 2013) and (Faverge et al., 2015). Most approaches are founded on Gaussian elimination, LU factorization, and Gauss-Jordan algorithm. Gaussian elimination minimizes the total number of operations for a given system Ax = b. LU factorization can be seen as an evolution of Gaussian elimination because the computation of the matrices L(lower diagonal) and U (upper diagonal) such that A = LU is similar to Gaussian elimination. However, once L and U are computed, solving the system LU x = b can be done by just applying a forward substitution followed by a backward substitution. The LU method thus minimizes the 2

number of computations when the system Ax = b must be solved for different vectors b and a constant matrix A. To our knowledge, Gauss-Jordan algorithm (and further optimizations) is the most parallel approach to solve a system Ax = b. It requires more operations than the previous methods but it offers the best latency. The latency is an important issue for real time simulation systems (Chen and Dinavahi, 2011), sometimes used in the context of Hardware In the Loop (HIL) environment (Arias-Garcia et al., 2013) (a system of six equations), (Ould-Bachir et al., 2012) (a system of fourteen equations). Such applications are generally based on a system of equations to solve at each time step, typically in the order of ten microseconds. In some applications, the time step can even be as low as a single microsecond, or less, which is hardly achievable with CPU-based solutions (Ould-Bachir et al., 2015). When the linear system is known in advance, it is possible to pre-compute the inverse matrix (Dufour et al., 2012), leading to straightforward hardware architectures involving a simple matrix product. In some cases, iterative methods can also be applied (Prasanna and Morris, 2007). Such methods use the latest known solution to converge towards a new solution taking the recent small changes into account. Nevertheless, in the general case, these methods can not be applied for real time low latency applications. FPGAs are well known to enable the implementation of massively parallel architectures in fixed-point arithmetic, making them excellent candidates for DSP algorithms. Since the beginning of the century, FPGAs can also compete with standard processors for best performances in single precision applications (Lienhart et al., 2002), and later in double precision applica-

3

tions (Govindu et al., 2005; Zhuo and Prasanna, 2008). This last decade, the number of dedicated arithmetic blocks available in FPGAs has grown impressively. FPGA manufacturers ship circuits with thousands of fixed-point embedded multipliers and millions of logic blocks that can be assembled together to build hundreds of double precision floating point operators. Today, FPGAs are really interesting targets for parallel applications requiring floating point arithmetic. Most FPGA-based floating point operators have a highly pipelined architecture because this is the best way to maximize their efficiency. Nevertheless, this is an important obstacle in presence of operations with strong dependencies since the next operation can only start when the previous one has passed through all the pipeline stages. Moreover the division operator, which is required by the previously cited methods, can not benefit straightforwardly from the dedicated arithmetic blocks, which results in very long pipelines. Such an operator can require up to 61 stages compared to 14 stages for an adder and 10 stages for a multiplier in double precision arithmetic (Altera, 2014). In the general case, the Gauss-Jordan algorithm requires N steps and a pivot row must be selected at each step. More details are given in Section 3. Pivoting enables to lower the residues of the solution. In some cases, Gaussian elimination simply does not work when pivoting is not supported because it would lead to a divide-by-zero error. Nevertheless, in some cases (e.g. diagonally dominant systems), it can be proven that pivoting is not mandatory. This paper extends a preliminary version of the work published in (David, 2015). At that time, pivoting was not implemented and only

4

partial compilation results were presented. This work fully describes the proposed method to eliminate the need of divisions during Gauss-Jordan process and proposes the following new contributions: 1. A complete architecture based on the division free algorithm, with pivoting, that exploits both pipelining and parallelism. 2. An algorithm to evaluate the total latency of the proposed solver when pivoting is implemented. 3. Synthesis results of complete systems (4 to 100 equations) with pivoting. Without pivoting, results show that an 18-equation system can be solved within a microsecond in single precision while 100-equation systems can be solved within ten microseconds in single or double precision. Pivoting increases the latency of the critical path. Nevertheless, it is still possible to solve 100-equation systems within ten microseconds in single precision. The paper is organized as follows. Section 2 reports and discusses previous works related to linear system solvers on FPGA in floating point arithmetic. Processor arrays and division free implementations are also reported. Section 3 describes the basics of Gauss-Jordan algorithm and details a previous work (Martinez-Alonso et al., 2010) close to the proposed method. Sections 4 and 5 respectively present the proposed algorithm and architecture. Sections 6 and 7 evaluate and discuss the performance of the proposed method in terms of FPGA resources and global latency. Section 8 concludes this work.

5

2. Previous work 2.1. FPGA implementations Linear solvers, and more generally linear algebra, are the foundation of many applications. As soon as the floating point arithmetic has been efficient on FPGA, several works have been reported in these fields. In 2004, Johnson et al proposed an FPGA implementation of the LU algorithm (Johnson et al., 2004) while B¨ohm and Hammes proposed an FPGA implementation of the Gauss Seidel algorithm (B¨ohm and Hammes, 2004). Both works are based on single precision operators. The same year, Daga et al proposed an implementation of the LU algorithm by blocks in single or double precision arithmetic (Daga et al., 2004). Up to ten processing elements can run in parallel in a single FPGA. Each one is based on deeply pipelined floating point units, including a 32-stage divider that is in the critical path of the algorithm. In 2006, de Matos and Neto report a Gauss Jordan implementation with speedups of 3 compared to a high-end processor for matrices of orders up to 1700 (de Matos and Neto, 2006). It is interesting to note that they use pipelined single precision floating point operators with 18 stages for multiplications/subtractions and 54 stages for division. The next year, the same authors propose a memory optimized architecture with a speedup of 2 compared to the previous one (de Matos and Neto, 2007). In 2009, an approach to invert a matrix with Gauss-Jordan algorithm is proposed in (Duarte et al., 2009). The architecture is based on double precision Row Processing Units (RPU) that include a custom designed 64stage pipelined data path. 29 cycles out the 64 are required by the division. 6

Results show that a Virtex XC5VSX240T FPGA can contain up to 87 RPU and process 510 × 510 matrices. In 2012, Zhang et al propose a portable and scalable architecture to implement the LU factorization on FPGA (Zhang et al., 2012). They report speedups of 2.2 and 1.7 for the processing of 10000 × 10000 matrices in single precision arithmetic and double precision arithmetic respectively. In order to minimize the use of dividers, the authors use a single reciprocal unit followed by many multipliers running in parallel. The same year, an architecture is proposed to invert a matrix with Gauss-Jordan algorithm in single, 40-bit and double precision (Arias Garcia et al., 2012b). The authors also measure the error propagation due to floating point arithmetic. In (Arias Garcia et al., 2012a), the same authors propose an implementation of Gaussian elimination in single precision arithmetic. In 2013, Moussa et al propose an implementation of Gauss-Jordan algorithm in single precision arithmetic for complex numbers (Moussa et al., 2013). The emphasis is put on the architecture of operators that handle complex numbers. The same year, Chowdhury at al present multicore-oblivious algorithms, including the Gaussian Elimination algorithm(Chowdhury et al., 2013). In 2015, Ziad et al propose a pipelined architecture in single precision arithmetic for solving dense linear systems with reasonable power and area overheads (Ziad et al., 2015). The same year, Faverge et al propose a way to combine the stability of QR and the efficiency of LU factorizations to design high-performance dense linear algebra solvers (Faverge et al., 2015).

7

2.2. Processor arrays and division free implementations The concept of a division free version of Gaussian elimination algorithm was already known in 1968, when Bareiss proposed a two-step optimization of this method (Bareiss, 1968). Actually, Bareiss mentions that the one-step division free method was already known by Jordan himself. In 1993, Kirsch et al propose a scaling technique, which reduces the number of divisions. Nevertheless, the scalings affect the precision (Kirsch and Turner, 1993, 1994). In 1996, Peng at al propose an iterative version of the two-step method and a 2D array processor architecture suitable for VLSI implementation (Peng et al., 1996) that are inspired by Bareiss’ work. The same year, Bermond et al propose a parallel implementation of standard Gaussian elimination on a systolic array (Bermond et al., 1996). In 1997, Bhuvaneswari et al propose a systolic array implementation of the successive Gaussian elimination with 2N processing elements. In 2010, Martinez-Alonzo et al proposed an FPGA implementation of the one-step division free method with an array of processors supporting single or double precision arithmetic (Martinez-Alonso et al., 2010). Nevertheless, the authors do not mention the known problem of the increasing magnitude of data. Since data are multiplied at each step, their magnitude grows exponentially, leading to numbers that can not be handled efficiently. Our proposed architecture looks like an evolution of said architectures. However, we fix the problem of the magnitude; we take into account the pipelined architecture of floating point operators to optimize the performance; and we support pivoting. More details are given in the next sections.

8

3. Foundation In this section, we detail Gauss-Jordan algorithm, which is the foundation of the current work. We also detail the division-free algorithm, mentioned by Bareiss as already known by Jordan (Bareiss, 1968) and a recent architecture inspired by that work: the division-free parallel architecture (MartinezAlonso et al., 2010). Their aim is to solve the system Ax = b, where A is a n × n matrix and x, b are n-element vectors. Both Gauss-Jordan algorithm and the division-free parallel architecture are n-step algorithms aiming at simplifying the matrices A and b such that, at the final step, A is a diagonal matrix (identity in the case of Gauss-Jordan algorithm), leading to a trivial solution for Ax = b. In the following, matrix C is the concatenation of A and b, C = A|b. The content of matrix C at row i and column j after step k is noted Ck (i, j). The ith row of C after step k is noted Ck (i, . . . ). C0 is the initial content of C. 3.1. Gauss-Jordan algorithm Gauss-Jordan algorithm is founded on the following property: replacing a row of C by a linear combination of itself (with a non-zero coefficient) and other rows of C does not affect the solution of the system. Each step k is dedicated to simplify the column k so that, at the end of the step, all the elements of column k are zero, except the diagonal element C(k, k) which has become the unity. When pivoting is supported, each step starts with a row swapping to maximize the absolute value of C(k, k). Pivoting enhances the residual value of the solution and also fixes the problem of a null value for C(k, k). The whole process is described in Algorithm 1.

9

It must be noted that, since after step k, the columns 1 to k are filled with zeros except diagonal elements that are ones, the number of computations actually processed in the most inner loop of Algorithm 1 can be reduced. Moreover, all the computations in step k can be processed in parallel since they only depend on the values computed in step k − 1. 3.2. The division-free algorithm The division-free algorithm (DFA) eliminates the need of divisions until the very last step where A is diagonal. This is particularly interesting for FPGA implementation because divisions are expensive in terms of logical gates and latency. Actually, FPGAs have dedicated logic to compute multiplications, additions and subtractions but not divisions. Eliminating divisions enables much more efficient FPGA implementations. The DFA exploits the fact that a row in C can be multiplied by a non-zero value without affecting the solution of Ax = b. We propose a slightly different illustration of the algorithm presented in (Bareiss, 1968) and used in (Martinez-Alonso et al., 2010), to better illustrate the relation with Gauss-Jordan algorithm. Actually, the DFA described in Algorithm 2 is exactly the same as Algorithm 1 where each row is multiplied by M . As with Gauss-Jordan algorithm, all the computations can be processed in parallel. In (Martinez-Alonso et al., 2010), an array of processors is used to compute each step. One processor is associated to each element of C and the processors can forward the data to their neighbors in order to initialize the array, propagate the multiplicative constants and get the results back. However, multiplying each row by M at each step has an important side effect since the magnitude of M increases at each step too. Actually, the 10

order of the magnitude of each element of C is squared at each step. In fixedpoint arithmetic, this means that the bit-length of each element is doubled at each step. The method is thus not applicable to fixed-point arithmetic. In floating-point arithmetic, it is the magnitudes of the exponents that is doubled at each step, which limits the application of the method to small systems. The authors do not mention this problem. There is another constraint related to logic design, especially in FPGA implementations: complex operators must be pipelined in order to efficiently exploit their computational power. Floating point arithmetic easily leads to data paths with tens of pipeline stages. If a pipelined circuit is used to compute a single evaluation over the whole step, the circuit’s efficiency is only a small fraction of its potential maximum efficiency. Few details are given about the architecture in (Martinez-Alonso et al., 2010) but since a dedicated processor is used for each element of C, it seems that the pipeline is not exploited. 4. Proposed algorithm and pipeline opportunity 4.1. Proposed algorithm The proposed algorithm is illustrated in Algorithm 3. The only difference with the DFA is the division by the power of two estimate of M . In floating point arithmetic, this estimate is actually the exponential part of the number. Dividing by a power of two can be implemented with a simple subtraction applied to the exponent field. This is one of the main contributions of the present paper: normalizing the exponents only requires a single pipeline stage, very few logic resources in an FPGA and it prevents the rows from 11

increasing as in the DFA. Taking advantage of the pipeline requires an analysis of the data dependencies between two steps, since we already know that there are no data dependencies inside a step. It appears that all the computations involve M and L, which are values from row k and row i, computed at the previous step. Moreover, at the end of step k, all the elements of column k are null but Ck (k, k). So, at step k, we should only compute Ck (i, j) where j > k and update the diagonal elements Ck (i, i) for i < k. A practical approach to implement such optimization is illustrated in Algorithm 4, where the diagonal elements are actually stored in an extra column at the end of matrix C. 4.2. Pipeline opportunity A simple example will best illustrate the opportunity offered by pipelined operators. In our example, coherent with floating point operators proposed by Altera, a multiplication requires 5 pipeline stages; an addition requires 7 pipeline stages and the division by a power of 2 requires a single pipeline stage. The heart of Algorithm 4 is the computation processed in lines 16 and 17, which we will name PMSS (Parallel Multiplication, Subtraction, Scaling). It is illustrated in Figure 1. The total latency of the PMSS operator is 13 cycles since the two multiplications are computed in parallel. Let us consider a matrix C with n = 20. If we use a different PMSS operator for each element of C, as in (Martinez-Alonso et al., 2010), we need 20 × 20 = 400 PMSS operators. In a first time, we don’t take the communication delay nor the pivoting delay into account so each step only requires 13 clock cycles and the total computation requires 20 steps ×13 = 260 clock cycles. However, only one stage out of thirteen of each PMSS 12

M

(pipelined) PMSS operator

× -

1 -----|M|

× L

Figure 1: The Parallel Multiplication, Subtraction, Scaling (PMSS) operator

operator is actually processing useful data at each clock cycle, resulting in only 7.7% efficiency. Let us now consider the case where a single PMSS operator is used for each row. Algorithm 4 first computes the M and L values that will be needed at the next step. At step 1, all C0 (i, n + 2) are zeros so we only need 20 clock cycles to feed the PMSS operators and compute values C1 (i, 2 . . . 21)). The M and L values required for step 2 will already be available after 13 clock cycles so the step 2 can start after 20 clock cycles. Step 2 will also require 20 clock cycles to compute values C2 (i, 3 . . . 22). Starting at step 3, the number of operations to compute per step is decremented at each step. Step 3 will require 19 clock cycles, step 4 will require 18 clock cycles ... step 7 will require 14 clock cycles and all the following steps will require 14 cycles too since this is the minimum time to compute the next M and L values and to get the next following element from the master and slave PMSS operators. Finally, step 20 will feed the PMSS operators with only 2 data and the last result will be obtained after 14 clock cycles. As a result, the whole algorithm requires 307 clock cycles (an increase of 18% in time) with only 20 PMSS operators (only 5% of the 400 PMSS operators in the previous configuration).

13

An intermediate configuration would use 2 PMSS operators for each row such that each step would only require 14 clock cycles (13 for last one). In this configuration, we now have 279 clock cycles (an increase of 7% in time) but only 40 PMSS operators, a reduction of 90% of the logic. This example has demonstrated the potential of pipelined resources to implement the proposed algorithm. The next section details the proposed architecture that can fully exploit such resources. 5. Proposed architecture We propose an architecture based on the PMSS operator embedded inside a Row Processing Element (RPE), which eventually also contains a local memory. The global architecture is illustrated in Figure 2. Each row is processed by a different RPE. At step k, all the RPE send the value of their column k to the pivot selector. The largest value is selected as the next M and the associated row becomes the pivot row. A possible implementation is a binary tree of comparators. However, rows are not actually swapped. The RPE associated to the pivot row simply becomes the master and all the other RPE become slaves, with their associated L. Only the RPE that have never been master are eligible to become a master. The role of the master RPE is to send its data, column by column, to all the slaves in parallel through two global buses. It does not make any computation since row k is not modified at step k. M has already been sent to determine the master. Only the elements k + 1 . . . n + 1 are sent by order of increasing column index. The role of the slave RPE associated to row i is to compute the PMSS 14

C(k,j) M

RPE 1

M

M C(i,j) PMSS Operator

Local memory or direct routing

L Pivot selector RPE x PMSS Operator

Memory or routing

PMSS Operator

Memory or routing

RPE n

Figure 2: Global view of the proposed architecture

operation, column by column as soon as the data arrive from the master RPE. L has already been sent to determine the master. Only the elements k + 1 . . . n + 1 are sent by order of increasing column index simultaneously with the master elements. Depending on the number of operations to compute and the latency of the PMSS operator, the results can be stored locally or sent back straightforwardly to the next step. 6. Results This section analyzes the number of clock cycles required by the proposed architecture to solve a system Ax = b, depending on the size of the system and the latency of the PMSS operator. The total time required to compute the solution is actually the number of clock cycles divided by the frequency of the system, which mainly depends on the architecture of the PMSS operator. 15

It is not the purpose of this work to propose the best implementation of the PMSS operator since previous works have already addressed the question of floating point fused operators and the performances are highly dependent of the target FPGA. Nevertheless, we present in this section some synthesis results of different configurations of the PMSS operator in order to give an insight of what is achievable with standard components from the FPGA manufacturer’s library. 6.1. Latency Considering the architecture proposed in Figure 2, we also evaluate the opportunity of using P PMSS operators running in parallel inside each RPE, to best address the case of large systems. Let’s call Lpo the total latency of the PMSS operator, Lpivot the latency to find the pivot row and Lrouting the latency of the routing to send the data in parallel to all the slave RPE (the access to the bus is actually implemented as a large multiplexor, which may require one or several additional clock cycles, depending on the size of the system). The total number of PMSS activations in a RPE at step k is N for the first step and N + 2 − k for the next steps, since the null values don’t need to be recomputed. When P > 1, the total number of activations per PMSS instance must be divided per P (rounded up). The last step has only two PMSS activations so the number of clock cycles required by the last step is L + 1 when P = 1 and L as soon as P > 1. These considerations are the foundation of Algorithm 5, which computes the total number of clock cycles from N , Lpivot , Lrouting , Lpo and P . In the first step, we take into account the computation of the first pivot row, the 16

sending of all the data to the PMSS operator and the computation of the second pivot row. It must be noted that the calculus of the pivot row can start as soon as the first data is output from the PMSS operator. Moreover, Lpivot should never be null because it takes at least one more clock cycle to receive the second data from the PMSS operator and be ready to start a new step, except for step one. The results presented in (David, 2015) did not take the pivoting into account. In such case, Lpivot = 1 (as mentioned previously) and Lpo includes the latency of the routing. A condensed version of these results is presented in Table 1 (P=1), Table 2 (P=2) and Table 3 (P=4). When pivoting is used, its latency can be estimated as Lpivot = dLog2 (n)e since we propose to use a binary tree of comparators. The latency of the routing is more difficult to estimate because it depends on the target technology. It can be implemented by a bus or a multiplexer of size n. In this paper, we consider Lrouting = dLog10 (n)e. Results are presented in Table 4 (P=1), Table 5 (P=2) and Table 6 (P=4). 6.2. FPGA resources The results presented in this section have been computed with Quartus II software 64-bit (Altera), version 15.0.0 build 145. We chose the FPGA Stratix V 5SGSMD8N1F45C2, which has the highest number of DSP blocks (1963) and 262400 ALM. We use the default synthesis options (Balanced - Normal flow). Reported frequencies have been computed with the Slow 900mV 85C model (post place and route). The floating point multiplier altfp mult, proposed by Altera, can be synthesized with 5, 6, 10 or 11 pipeline stages with custom sizes for exponent 17

and mantissa fields. We have synthesized both single precision format (8-bit exponent, 23-bit mantissa) and double precision format (11-bit exponent, 52-bit mantissa). Synthesis results are reported in Table 7. The floating point subtractor altfp add sub, proposed by Altera, can be synthesized with 7 to 14 pipeline stages. A few test cases synthesis results are reported in Table 8. We define the configuration format of the PMSS operator: . The synthesis results obtained suggest to try the following configuration formats: 5-7 and 5-12 in single precision, 5-7, 6-7 and 10-7 in double precision. This way, we minimize the latency for a given frequency. It must be understood that, when we combine two operators, only the lowest frequency is relevant. Synthesis results of the PMSS operator are reported in Table 9. These latest results suggest to finally keep single precision formats 5-7, 5-12 and double precision format 5-7 for the PMSS operator. It appears that the results presented in (David, 2015) contain a flaw for the configuration 5 − 12(18). Due to an error in the collection of results, the latency used for that configuration (Tables VII, VIII and IX in said document) was 28 instead of 18, which resulted in a global latency higher than the actual one for that configuration. Table 10 presents a condensed view of the total latency without pivoting as it should have appeared in (David, 2015). Table 11 shows the total latency of the proposed architecture with pivoting, in the conditions already mentioned. Both tables are founded on the frequencies obtained in Table 9.

18

6.3. A complete system implementation In order to validate the proposed architecture on a complete system, we have implemented a generic solver in VHDL and we have synthesized the system for several sizes n ranging from 4 to 100. We use the same FPGA Stratix V 5SGSMD8N1F45C2 and the same synthesis options as the ones reported in the begin of this section. Each RPE has a single 5-7 PMSS operator (single precision, P = 1, Lpo = 13). The pivot selector is a generic binary tree of comparators. However, the comparators only compare the exponent fields since they adequately represent the amplitude of the numbers. Our aim here is simply to validate that the proposed Algorithm 5 can accurately compute the total latency and that the proposed architecture can really compute the solution of a system Ax = B. In this context, the proposed implementation has not been fully optimized. The latency of the pivot finder could be lowered. The routing latency could better fit the size of the system and the latency of the PMSS operators could be lowered too by using fused operators. The synthesis results of the complete system are reported in Table 12. The architecture has been validated at RTL for n = 10 and n = 20. We have used random matrices A and b and the results are perfectly coherent with a high level model of the algorithm written in Java. The number of clock cycles computed by Algorithm 5 has also been confirmed by the simulation. 7. Discussion The proposed algorithm and architecture have been designed to minimize the latency as much as possible, which makes them particularly useful for 19

small systems that must compute a solution in real time. For large systems such as the ones targeted in (Zhang et al., 2012), the latency of the floating point operators is completely absorbed by the length of the rows. This can be observed in Figure 8 of said paper; the performance falls to zero when the size of the system becomes too low (compared to thousands of equations). Previous works don’t give many details on the total latency for small systems. In (Arias Garcia et al., 2012a), Figure 4 illustrates a latency varying from approximately 100us to 1400us when n varies from 10 to 100 (single precision arithmetic). Our results show a total latency varying between 0.543us and 6.534us in the same conditions, which is more than two orders of magnitude improvement. It is always difficult and hazardous to compare the performances between different FPGA implementations since many parameters may be different. This is why we have applied a conservative methodology to evaluate our performances with standard floating point resources and standard synthesis options. We are confident that a fused PMSS operator would reduce the number of pipeline stages and a fine tuning of the place and route process may also lead to better timings. Altera recently proposed the Aria 10 and Stratix 10 series, which embed hardened floating-point DSP blocks. In this context, the latency of the PMSS operator could be as low as four clock cycles for single precision arithmetic. Nevertheless, the contributions of this paper are the proposed algorithm and architecture, not the estimation of a particular implementation performance. The results presented in this paper enable a better comparison to previous works. Compared to results presented in (David, 2015), the implementation of

20

the pivoting increases the total latency from 2.5% (single precision, n=100, P=1, Lpo = 13) to 50% (single precision, n=40, P=4, Lpo = 13), depending on the configuration. The more the pivoting is in the critical path, the higher the latency overhead. Nevertheless, it is still possible to solve 100-equation systems in single precision within ten microseconds. It must also be noted that the final result is a diagonal matrix. A last division step may be required if the final values are required. Such resource can be shared by several or all the RPE. The ALTFP DIV divider proposed by Altera requires 33 pipeline stages in single precision arithmetic and 61 stages in double precision arithmetic. Evaluating the benefit of the proposed solution on the precision is a difficult task. Actually, floating point operators are not associative as illustrated in the following operations on double precision format :

10−50 + (−10−50 ) + 1 + (−1) = ? (10−50 + (−10−50 )) + (1 + (−1)) = 0 10−50 + (((−10−50 ) + 1) + (−1)) = 10−50 (−10−50 ) + ((10−50 + 1) + (−1)) = −10−50 In floating point, a given system of equations can thus lead to different solutions by simply swapping two variables. Division-free algorithms theoretically enhance the precision since they avoid the division round-off error and thus increase the numerical stability of the algorithm (Peng et al., 1996). The proposed algorithm replaces the divisions of the Gauss-Jordan algorithm by error-free scalings. We would have appreciated to measure an increase of 21

the precision. Nevertheless, in our experiments with double precision arithmetic on random systems, we have not been able to observe a significant difference between the precision of the proposed approach and the precision of the standard Gauss-Jordan algorithm. 8. Conclusion We have proposed a division free algorithm and a corresponding parallel and pipelined architecture based on well-known Gauss-Jordan algorithm to optimize the solving of linear systems at low latency. Results show two orders of magnitude improvement compared to previous works since a hundred equation system can be solved in less than ten microseconds with a state of the art FPGA in single or double precision arithmetic. We have also proposed an improvement of the architecture to support pivoting. When pivoting is enabled, the latency overhead varies from 2.5% to 50%, depending on the configuration of the architecture and the size of the system. This work confirms that, today, FPGAs are powerful targets for highly parallel applications involving floating point arithmetic in single or double precision.

22

Altera, 2014. Floating-point ip cores user guide. URL https://www.altera.com Arias-Garcia, J., Braga, A., Llanos, C., Ayala-Rincon, M., Pezzuol Jacobi, R., Foltran, A., Feb 2013. Fpga hil simulation of a linear system block for strongly coupled system applications. In: Industrial Technology (ICIT), 2013 IEEE International Conference on. pp. 1017–1022. Arias Garcia, J., Llanos, C., Ayala Rincon, M., Jacobi, R., Feb 2012a. A fast and low cost architecture developed in fpgas for solving systems of linear equations. In: Circuits and Systems (LASCAS), 2012 IEEE Third Latin American Symposium on. pp. 1–4. Arias Garcia, J., Llanos, C., Ayala Rincon, M., Jacobi, R., March 2012b. Fpga implementation of large-scale matrix inversion using single, double and custom floating-point precision. In: Programmable Logic (SPL), 2012 VIII Southern Conference on. pp. 1–6. Bareiss, E. H., 1968. Sylvesters identity and multistep integer-preserving gaussian elimination. Mathematics of computation 22 (103), 565–578. Bermond, J.-C., Peyrat, C., Sakho, I., Tchuent´e, M., Feb. 1996. Parallelization of the gaussian elimination algorithm on systolic arrays. J. Parallel Distrib. Comput. 33 (1), 69–75. URL http://dx.doi.org/10.1006/jpdc.1996.0025 B¨ohm, W., Hammes, J., 2004. A transformational approach to high performance embedded computing. Tech. rep., DTIC Document.

23

Chen, Y., Dinavahi, V., July 2011. An iterative real-time nonlinear electromagnetic transient solver on fpga. In: Power and Energy Society General Meeting, 2011 IEEE. pp. 1–8. Chowdhury, R. A., Ramachandran, V., Silvestri, F., Blakeley, B., 2013. Oblivious algorithms for multicores and networks of processors. Journal of Parallel and Distributed Computing 73 (7), 911–925. Daga, V., Govindu, G., Prasanna, V., Gangadharpalli, S., Sridhar, V., 2004. Efficient floating-point based block lu decomposition on fpgas. In: in Proceedings of the 2004 International Conference on Engineering Reconfigurable Systems and. pp. 276–279. David, J. P., Dec 2015. Low latency solver for linear equation systems in floating point arithmetic. In: 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig). pp. 1–7. de Matos, G., Neto, H., Aug 2006. On reconfigurable architectures for efficient matrix inversion. In: Field Programmable Logic and Applications, 2006. FPL ’06. International Conference on. pp. 1–6. de Matos, G., Neto, H., Feb 2007. Memory optimized architecture for efficient gauss-jordan matrix inversion. In: Programmable Logic, 2007. SPL ’07. 2007 3rd Southern Conference on. pp. 33–38. Duarte, R., Neto, H., Vestias, M., Aug 2009. Double-precision gauss-jordan algorithm with partial pivoting on fpgas. In: Digital System Design, Architectures, Methods and Tools, 2009. DSD ’09. 12th Euromicro Conference on. pp. 273–280. 24

Dufour, C., Cense, S., Ould-Bachir, T., Gregoire, L., Belanger, J., Oct 2012. General-purpose reconfigurable low-latency electric circuit and motor drive solver on fpga. In: IECON 2012 - 38th Annual Conference on IEEE Industrial Electronics Society. pp. 3073–3081. Faverge, M., Herrmann, J., Langou, J., Lowery, B., Robert, Y., Dongarra, J., 2015. Mixing lu and qr factorization algorithms to design high-performance dense linear algebra solvers. Journal of Parallel and Distributed Computing 85, 32–46. Govindu, G., Scrofano, R., Prasanna, V. K., 2005. A library of parameterizable floating-point cores for fpgas and their application to scientific computing. In: In Proc. of International Conference on Engineering Reconfigurable Systems and Algorithms. pp. 137–148. Johnson, J., Nagvajara, P., Nwankpa, C., 2004. High-performance linear algebra processor using fpga. Tech. rep., DTIC Document. Kirsch, B. J., Turner, P. R., Jun 1993. Adaptive beamforming using rns arithmetic. In: Computer Arithmetic, 1993. Proceedings., 11th Symposium on. pp. 36–43. Kirsch, B. J., Turner, P. R., 1994. Modified gaussian elimination for adaptive beam forming using rns arithmetic. Tech. rep., DTIC Document. Lienhart, G., Kugel, A., Manner, R., 2002. Using floating-point arithmetic on fpgas to accelerate scientific n-body simulations. In: Field-Programmable Custom Computing Machines, 2002. Proceedings. 10th Annual IEEE Symposium on. pp. 182–191. 25

Martinez-Alonso, R., Mino, K., Torres-Lucio, D., Sept 2010. Array processors designed with vhdl for solution of linear equation systems implemented in a fpga. In: Electronics, Robotics and Automotive Mechanics Conference (CERMA), 2010. pp. 731–736. Moussa, S., Razik, A., Dahmane, A., Hamam, H., May 2013. Fpga implementation of floating-point complex matrix inversion based on gauss-jordan elimination. In: Electrical and Computer Engineering (CCECE), 2013 26th Annual IEEE Canadian Conference on. pp. 1–4. Ould-Bachir, T., Blanchette, H. F., Al-Haddad, K., June 2015. A network tearing technique for fpga-based real-time simulation of power converters. IEEE Transactions on Industrial Electronics 62 (6), 3409–3418. Ould-Bachir, T., Dufour, C., Belanger, J., Mahseredjian, J., David, J.-P., May 2012. Effective floating-point calculation engines intended for the fpga-based hil simulation. In: Industrial Electronics (ISIE), 2012 IEEE International Symposium on. pp. 1363–1368. Peng, S., Sedukhin, S., Sedukhin, I., Aug 1996. Parallel algorithm and architecture for two-step division-free gaussian elimination. In: Application Specific Systems, Architectures and Processors, 1996. ASAP 96. Proceedings of International Conference on. pp. 183–192. Prasanna, V., Morris, G., March 2007. Sparse matrix computations on reconfigurable hardware. Computer 40 (3), 58–64. Shen, K., Crossley, J., Lun, A., Liu, H., 1999. The Nine Chapters on the

26

Mathematical Art: Companion and Commentary. Oxford University Press. URL https://books.google.ca/books?id=eiTJHRGTG6YC Zhang, W., Betz, V., Rose, J., Mar. 2012. Portable and scalable fpga-based acceleration of a direct linear system solver. ACM Trans. Reconfigurable Technol. Syst. 5 (1), 6:1–6:26. URL http://doi.acm.org/10.1145/2133352.2133358 Zhuo, L., Prasanna, V. K., Aug 2008. High-performance designs for linear algebra operations on reconfigurable hardware. IEEE Transactions on Computers 57 (8), 1057–1071. Ziad, M. T. I., Alkabani, Y., El-Kharashi, M. W., Aug 2015. On hardware solution of dense linear systems via gauss-jordan elimination. In: Communications, Computers and Signal Processing (PACRIM), 2015 IEEE Pacific Rim Conference on. pp. 364–369.

27

Algorithm 1: Gauss-Jordan algorithm Data: Matrices A(n × n) and b(n × 1) Result: C = A|b is transformed so that A is identity and still Ax = b 1 2

for k=1 to n do if (pivoting) then

3

Find Ck−1 (m, k) | ∀n 6= m, |Ck−1 (n, k)| < |Ck−1 (m, k)|

4

Swap rows k and m

5

end

6

M = Ck−1 (k, k)

7

for i=1 to n do

8

L = Ck−1 (i, k)

9

if (k == i) then

10

// Make Ck (i, k) one

11

Ck (k, . . . ) =

1 C (k, . . . ) M k−1

else

12 13

// Make Ck (i, k) zero

14

Ck (i, . . . ) = Ck−1 (i, . . . ) − end

15 16 17

end end

28

L C (k, . . . ) M k−1

Algorithm 2: The division-free algorithm (Martinez-Alonso et al., 2010) Data: Matrices A(n × n) and b(n × 1) Result: C = A|b is transformed so that A is diagonal and still Ax = b 1 2

for k=1 to n do if (pivoting) then

3

Find Ck−1 (m, k) | ∀n 6= m, |Ck−1 (n, k)| < |Ck−1 (m, k)|

4

Swap rows k and m

5

end

6

M = Ck−1 (k, k)

7

for i=1 to n do

8

L = Ck−1 (i, k)

9

if (k == i) then

10

// row k is not modified

11

Ck (k, . . . ) = Ck−1 (k, . . . ) else

12 13

// Make Ck (i, k) zero

14

Ck (i, . . . ) = M Ck−1 (i, . . . ) − LCk−1 (k, . . . ) end

15 16 17

end end

29

Algorithm 3: Proposed algorithm Data: Matrices A(n × n) and b(n × 1) Result: C = A|b is transformed so that A is diagonal and still Ax = b 1

Initialize C0 = A|b

2

for k=1 to n do

3

if (pivoting) then

4

Find Ck−1 (m, k) | ∀n 6= m, |Ck−1 (n, k)| < |Ck−1 (m, k)|

5

Swap rows k and m

6

end

7

M = Ck−1 (k, k)

8

for i=1 to n do L = Ck−1 (i, k)

9

if (k == i) then

10 11

// row k is not modified

12

Ck (k, . . . ) = Ck−1 (k, . . . ) else

13 14

// Make Ck (i, k) zero

15

Ck (i, . . . ) = M Ck−1 (i, . . . ) − LCk−1 (k, . . . )

16

Ck (i, . . . ) = end

17 18 19

1 C (i, . . . ) 2x estimate of M k

end end

30

Algorithm 4: Proposed simplified algorithm Data: Matrices A(n × n) and b(n × 1) Result: C = A|b|d where d(i)x(i) = b(i) 1

Initialize C0 = A|b|0

2

for k=1 to n do

3

if (pivoting) then

4

Find Ck−1 (m, k) | ∀n 6= m, |Ck−1 (n, k)| < |Ck−1 (m, k)|

5

Swap rows k and m

6

end

7

M = Ck−1 (k, k)

8

for i=1 to n do L = Ck−1 (i, k)

9

if (k == i) then

10 11

// row k is not modified

12

Ck (k, k + 1 . . . n + 1) = Ck−1 (k, k + 1 . . . n + 1)

13

Ck (k, n + 2) = Ck−1 (k, k) else

14 15

// Make Ck (i, k) zero

16

Ck (i, k + 1 . . . n + 2) = M Ck−1 (i, k + 1 . . . n + 2) − LCk−1 (k, k + 1 . . . n + 2) Ck (i, k + 1 . . . n + 2) =

17

end

18 19 20

1 C (i, k 2x estimate of M k

end end

31

+ 1 . . . n + 2)

Algorithm 5: Evaluation of the total latency Data: n: the size of the system Data: Lpo : the latency of the PMSS operator Data: Lpivot : the latency to find the pivot row Data: Lrouting : the latency to route the data from the master PMSS to slave PMSS Data: P : the number of PMSS per RPE Result: nb cycles: The total number of clock cycles to make the system diagonal 1

LP R = Lpivot + Lrouting

2

//First step

3

nb cycles = LP R − 1 + max(Lpo + LP R , dn/P e)

4

//Internal steps

5

for k=2 to n-1 do

6

nb cycles+ = max(Lpo + LP R , d(n + 2 − k)/P e)

7

end

8

//Last step

9

if (P == i) then

10 11 12

nb cycles+ = Lpo + 1 else nb cycles+ = Lpo

13

end

14

return nb cycles

32

Table 1: Total latency for P=1 (clock cycles)

n

Lpo =13 Lpo =16 Lpo =19 Lpo =22 Lpo =25 Lpo =28 Lpo =31

4

56

68

80

92

104

116

128

8

112

136

160

184

208

232

256

12

168

204

240

276

312

348

384

16

229

272

320

368

416

464

512

20

307

349

400

460

520

580

640

40

937

979

1030

1090

1159

1237

1324

60

1967

2009

2060

2120

2189

2267

2354

80

3397

3439

3490

3550

3619

3697

3784

100

5227

5269

5320

5380

5449

5527

5614

Table 2: Total latency for P=2 (clock cycles)

n

Lpo =13 Lpo =16 Lpo =19 Lpo =22 Lpo =25 Lpo =28 Lpo =31

4

55

67

79

91

103

115

127

8

111

135

159

183

207

231

255

12

167

203

239

275

311

347

383

16

223

271

319

367

415

463

511

20

279

339

399

459

519

579

639

40

607

694

799

919

1039

1159

1279

60

1127

1214

1319

1442

1583

1742

1919

80

1847

1934

2039

2162

2303

2462

2639

100

2767

2854

2959

3082

3223

3382

3559

33

Table 3: Total latency for P=4 (clock cycles)

n

Lpo =13 Lpo =16 Lpo =19 Lpo =22 Lpo =25 Lpo =28 Lpo =31

4

55

67

79

91

103

115

127

8

111

135

159

183

207

231

255

12

167

203

239

275

311

347

383

16

223

271

319

367

415

463

511

20

279

339

399

459

519

579

639

40

559

679

799

919

1039

1159

1279

60

844

1019

1199

1379

1559

1739

1919

80

1209

1386

1599

1839

2079

2319

2559

100

1674

1851

2064

2313

2599

2899

3199

Table 4: Total latency for P=1 (clock cycles), with pivoting n

Lpivoting

Lrouting

Lpo =13 Lpo =16 Lpo =19 Lpo =22 Lpo =25 Lpo =28 Lpo =31

4

2

1

64

76

88

100

112

124

136

8

3

1

136

160

184

208

232

256

280

12

4

2

228

264

300

336

372

408

444

16

4

2

304

352

400

448

496

544

592

20

5

2

400

460

520

580

640

700

760

40

6

2

1049

1112

1184

1265

1355

1454

1562

60

6

2

2079

2142

2214

2295

2385

2484

2592

80

7

2

3529

3595

3670

3754

3847

3949

4060

100

7

2

5359

5425

5500

5584

5677

5779

5890

34

Table 5: Total latency for P=2 (clock cycles), with pivoting n

Lpivoting

Lrouting

Lpo =13 Lpo =16 Lpo =19 Lpo =22 Lpo =25 Lpo =28 Lpo =31

4

2

1

63

75

87

99

111

123

135

8

3

1

135

159

183

207

231

255

279

12

4

2

227

263

299

335

371

407

443

16

4

2

303

351

399

447

495

543

591

20

5

2

399

459

519

579

639

699

759

40

6

2

839

959

1079

1199

1319

1439

1559

60

6

2

1358

1487

1634

1799

1979

2159

2339

80

7

2

2119

2254

2407

2578

2767

2974

3199

100

7

2

3039

3174

3327

3498

3687

3894

4119

Table 6: Total latency for P=4 (clock cycles), with pivoting n

Lpivoting

Lrouting

Lpo =13 Lpo =16 Lpo =19 Lpo =22 Lpo =25 Lpo =28 Lpo =31

4

2

1

63

75

87

99

111

123

135

8

3

1

135

159

183

207

231

255

279

12

4

2

227

263

299

335

371

407

443

16

4

2

303

351

399

447

495

543

591

20

5

2

399

459

519

579

639

699

759

40

6

2

839

959

1079

1199

1319

1439

1559

60

6

2

1259

1439

1619

1799

1979

2159

2339

80

7

2

1759

1999

2239

2479

2719

2959

3199

100

7

2

2226

2499

2799

3099

3399

3699

3999

35

Table 7: Multiplier synthesis results (Freq in Mhz)

Multiplication Single precision Stages Freq

Double precision

ALM

Reg.

DSP

Freq

ALM

Reg.

DSP

5

444

94

190

1

292

181

296

4

6

446

103

252

1

339

205

474

4

10

451

145

457

1

367

293

862

4

11

460

155

493

1

364

310

936

4

Table 8: Subtractor synthesis results (Freq in Mhz)

Subtraction Single precision Stages Freq

ALM Reg.

Double precision DSP

Freq

ALM

Reg.

DSP

7

400

364

401

0

370

739

740

0

11

404

354

744

0

364

713

1372

0

12

550

363

788

0

466

726

1452

0

36

Table 9: PMSS operator synthesis results (Freq in Mhz)

PMSS operator Single precision Config

Freq

ALM

Reg.

DSP Latency

5-7

256

520

943

2

13

5-12

439

565

1254

2

18

Double precision Config

Freq

ALM

Reg.

DSP

5-7

205

1105

1569

8

13

6-7

210

1122

1813

8

14

10-7

207

1255

2592

8

18

Table 10: Total latency (us) without pivoting P=1 single Size 5-7 (13) 4

0.219

single 5-12 (18)

P=2 double

single

5-7(13) Size

0.173

0.273

5-7 (13)

single

P=4 double

5-12 (18) 5-7(13) Size

4

0.215

0.171

0.268

4

single

single

double

5-7 (13)

5-12 (18)

5-7(13)

0.215

0.171

0.268

8

0.437

0.347

0.546

8

0.433

0.344

0.541

8

0.433

0.344

0.541

12

0.656

0.520

0.819

12

0.652

0.518

0.814

12

0.652

0.518

0.814

16

0.894

0.693

1.117

16

0.870

0.691

1.088

16

0.870

0.691

1.088

20

1.198

0.871

1.497

20

1.089

0.864

1.361

20

1.089

0.864

1.361

40

3.657

2.307

4.570

40

2.369

1.737

2.960

40

2.182

1.731

2.726

60

7.677

4.656

9.593

60

4.399

2.923

5.496

60

3.294

2.597

4.116

80

13.259

7.916

16.567

80

7.209

4.565

9.008

80

4.719

3.475

5.896

100

20.401

12.088

25.493

100

10.800

6.662

13.495

100

6.534

4.535

8.164

37

Table 11: Total latency (us) with pivoting P=1 single Size 5-7 (13)

P=2

single

double

5-12 (18)

single

5-7(13) Size

P=4

single

5-7 (13)

double

5-12 (18) 5-7(13) Size

single

single

double

5-7 (13)

5-12 (18)

5-7(13)

4

0.250

0.192

0.312

4

0.246

0.189

0.307

4

0.246

0.189

0.307

8

0.531

0.401

0.663

8

0.527

0.399

0.658

8

0.527

0.399

0.658

12

0.890

0.657

1.112

12

0.886

0.654

1.107

12

0.886

0.654

1.107

16

1.187

0.876

1.483

16

1.183

0.873

1.478

16

1.183

0.873

1.478

20

1.561

1.140

1.951

20

1.557

1.138

1.946

20

1.557

1.138

1.946

40

4.094

2.642

5.116

40

3.275

2.369

4.092

40

3.275

2.369

4.092

60

8.114

4.991

10.139

60

5.300

3.609

6.623

60

4.914

3.554

6.140

80

13.774

8.308

17.211

80

8.271

5.367

10.335

80

6.865

4.922

8.579

100

20.916

12.481

26.136

100

11.861

7.465

14.821

100

8.688

6.154

10.856

Table 12: Synthesis results of the complete system

n

Lpivoting

Lrouting

Freq

ALM

Registers

Memory bits

DSP blocks

4

5

2

204.5

3026

5417

1024

8

8

6

2

196.77

6122

101778

4096

16

12

6

2

219.83

9092

16107

6936

24

16

7

2

219.39

12303

21570

17440

32

20

7

2

219.25

14986

26920

21800

40

40

8

2

214.55

30039

53957

84560

80

60

8

2

195.01

45379

80282

126840

120

80

9

2

182.98

61110

106914

332960

160

100

9

2

177.81

76363

133597

416200

200

38

Author Biography & Photograph

Jean Pierre David received the degree in electrical engineering from Université de Liège, Belgium, in 1995 and the Ph.D. degree from Université Catholique de Louvain, Louvain-la-Neuve, Belgium, in 2002. He has been an Assistant Professor with Université de Montréal, QC, Canada, for three years and has been an Assistant Professor with Polytechnique Montréal, QC, Canada, in 2006. He is associate professor since 2013. His research interests include digital system design, reconfigurable computing, high-level synthesis and their applications.