Accepted Manuscript Low latency and division free Gauss-Jordan solver in floating point arithmetic Jean Pierre David PII: DOI: Reference:
S0743-7315(16)30189-7 http://dx.doi.org/10.1016/j.jpdc.2016.12.013 YJPDC 3586
To appear in:
J. Parallel Distrib. Comput.
Received date: 16 June 2016 Revised date: 22 November 2016 Accepted date: 6 December 2016 Please cite this article as: J.P. David, Low latency and division free Gauss-Jordan solver in floating point arithmetic, J. Parallel Distrib. Comput. (2016), http://dx.doi.org/10.1016/j.jpdc.2016.12.013 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Highlights (for review)
Highlights for paper “Low Latency and Division Free Gauss-Jordan Solver in Floating Point Arithmetic” • • • • •
Low latency solvers are necessary for real time applications (simulation/control) The divider circuits used in most previous works induce long latencies We propose a division free parallel architecture adapted to floating point arithmetic We obtain two orders of magnitude gains compared to previous works 100-equation systems can be solved under 10 microseconds
Manuscript Click here to view linked References
Low Latency and Division Free Gauss-Jordan Solver in Floating Point Arithmetic Jean Pierre David1 Polytechnique Montr´eal
Abstract In many applications, the solution of a linear system is computed with Gaussian elimination followed by back-substitution, or Gauss-Jordan elimination. The latter is intrinsically more parallel, enabling smaller computing latencies at the price of more complex hardware. However both methods require the division operator, which leads to a time-consuming resource in the critical path of the algorithms and impacts the global processing’s latency. Jordan was already aware of a division free algorithm. However, its implementation involves multiplications at each step and the size of the numbers rapidly becomes too big for an efficient implementation of large systems. In this work, we present a small modification to the division free algorithm in order to keep the size of the numbers in a reasonable range for standard floating point numbers. This is possible thanks to the special format of floating point numbers, which enables error free and hardware efficient divisions by powers of two. We also propose a parallel and pipelined architecture that best exploits the proposed algorithm, including partial pivoting. We specially focus Corresponding author: Jean Pierre David, C.P. 6079, succursale Centre-ville, Montr´eal, QC, Canada, H3C 3A7, Tel +1 514 340 4711 p2009, Email:
[email protected] 1
Preprint submitted to Parallel and Distributed Computing
November 22, 2016
on the global latency of the system as a function of its size, the latency of the floating point operators, and the number of operators that are available. Results demonstrate that current FPGAs can solve linear systems larger than hundred equations within ten microseconds. This represents a two order of magnitude improvement over previous implementations for relatively small systems. Keywords: Gaussian Elimination, Gauss-Jordan, Linear Solver, Low Latency Solver, Real Time Simulation 1. Introduction Solving a system of linear equations Ax = b, where A is a matrix of size N × N and b is vector of size N , is a problem that was already addressed in ancient China, the first century BC. A method similar to Gaussian elimination was known at that time and described in a manuscript named ”The nine chapters” (Shen et al., 1999). The method is still in use today in many applications, under various implementations and with some optimizations, including recent works such as (Chowdhury et al., 2013) and (Faverge et al., 2015). Most approaches are founded on Gaussian elimination, LU factorization, and Gauss-Jordan algorithm. Gaussian elimination minimizes the total number of operations for a given system Ax = b. LU factorization can be seen as an evolution of Gaussian elimination because the computation of the matrices L(lower diagonal) and U (upper diagonal) such that A = LU is similar to Gaussian elimination. However, once L and U are computed, solving the system LU x = b can be done by just applying a forward substitution followed by a backward substitution. The LU method thus minimizes the 2
number of computations when the system Ax = b must be solved for different vectors b and a constant matrix A. To our knowledge, Gauss-Jordan algorithm (and further optimizations) is the most parallel approach to solve a system Ax = b. It requires more operations than the previous methods but it offers the best latency. The latency is an important issue for real time simulation systems (Chen and Dinavahi, 2011), sometimes used in the context of Hardware In the Loop (HIL) environment (Arias-Garcia et al., 2013) (a system of six equations), (Ould-Bachir et al., 2012) (a system of fourteen equations). Such applications are generally based on a system of equations to solve at each time step, typically in the order of ten microseconds. In some applications, the time step can even be as low as a single microsecond, or less, which is hardly achievable with CPU-based solutions (Ould-Bachir et al., 2015). When the linear system is known in advance, it is possible to pre-compute the inverse matrix (Dufour et al., 2012), leading to straightforward hardware architectures involving a simple matrix product. In some cases, iterative methods can also be applied (Prasanna and Morris, 2007). Such methods use the latest known solution to converge towards a new solution taking the recent small changes into account. Nevertheless, in the general case, these methods can not be applied for real time low latency applications. FPGAs are well known to enable the implementation of massively parallel architectures in fixed-point arithmetic, making them excellent candidates for DSP algorithms. Since the beginning of the century, FPGAs can also compete with standard processors for best performances in single precision applications (Lienhart et al., 2002), and later in double precision applica-
3
tions (Govindu et al., 2005; Zhuo and Prasanna, 2008). This last decade, the number of dedicated arithmetic blocks available in FPGAs has grown impressively. FPGA manufacturers ship circuits with thousands of fixed-point embedded multipliers and millions of logic blocks that can be assembled together to build hundreds of double precision floating point operators. Today, FPGAs are really interesting targets for parallel applications requiring floating point arithmetic. Most FPGA-based floating point operators have a highly pipelined architecture because this is the best way to maximize their efficiency. Nevertheless, this is an important obstacle in presence of operations with strong dependencies since the next operation can only start when the previous one has passed through all the pipeline stages. Moreover the division operator, which is required by the previously cited methods, can not benefit straightforwardly from the dedicated arithmetic blocks, which results in very long pipelines. Such an operator can require up to 61 stages compared to 14 stages for an adder and 10 stages for a multiplier in double precision arithmetic (Altera, 2014). In the general case, the Gauss-Jordan algorithm requires N steps and a pivot row must be selected at each step. More details are given in Section 3. Pivoting enables to lower the residues of the solution. In some cases, Gaussian elimination simply does not work when pivoting is not supported because it would lead to a divide-by-zero error. Nevertheless, in some cases (e.g. diagonally dominant systems), it can be proven that pivoting is not mandatory. This paper extends a preliminary version of the work published in (David, 2015). At that time, pivoting was not implemented and only
4
partial compilation results were presented. This work fully describes the proposed method to eliminate the need of divisions during Gauss-Jordan process and proposes the following new contributions: 1. A complete architecture based on the division free algorithm, with pivoting, that exploits both pipelining and parallelism. 2. An algorithm to evaluate the total latency of the proposed solver when pivoting is implemented. 3. Synthesis results of complete systems (4 to 100 equations) with pivoting. Without pivoting, results show that an 18-equation system can be solved within a microsecond in single precision while 100-equation systems can be solved within ten microseconds in single or double precision. Pivoting increases the latency of the critical path. Nevertheless, it is still possible to solve 100-equation systems within ten microseconds in single precision. The paper is organized as follows. Section 2 reports and discusses previous works related to linear system solvers on FPGA in floating point arithmetic. Processor arrays and division free implementations are also reported. Section 3 describes the basics of Gauss-Jordan algorithm and details a previous work (Martinez-Alonso et al., 2010) close to the proposed method. Sections 4 and 5 respectively present the proposed algorithm and architecture. Sections 6 and 7 evaluate and discuss the performance of the proposed method in terms of FPGA resources and global latency. Section 8 concludes this work.
5
2. Previous work 2.1. FPGA implementations Linear solvers, and more generally linear algebra, are the foundation of many applications. As soon as the floating point arithmetic has been efficient on FPGA, several works have been reported in these fields. In 2004, Johnson et al proposed an FPGA implementation of the LU algorithm (Johnson et al., 2004) while B¨ohm and Hammes proposed an FPGA implementation of the Gauss Seidel algorithm (B¨ohm and Hammes, 2004). Both works are based on single precision operators. The same year, Daga et al proposed an implementation of the LU algorithm by blocks in single or double precision arithmetic (Daga et al., 2004). Up to ten processing elements can run in parallel in a single FPGA. Each one is based on deeply pipelined floating point units, including a 32-stage divider that is in the critical path of the algorithm. In 2006, de Matos and Neto report a Gauss Jordan implementation with speedups of 3 compared to a high-end processor for matrices of orders up to 1700 (de Matos and Neto, 2006). It is interesting to note that they use pipelined single precision floating point operators with 18 stages for multiplications/subtractions and 54 stages for division. The next year, the same authors propose a memory optimized architecture with a speedup of 2 compared to the previous one (de Matos and Neto, 2007). In 2009, an approach to invert a matrix with Gauss-Jordan algorithm is proposed in (Duarte et al., 2009). The architecture is based on double precision Row Processing Units (RPU) that include a custom designed 64stage pipelined data path. 29 cycles out the 64 are required by the division. 6
Results show that a Virtex XC5VSX240T FPGA can contain up to 87 RPU and process 510 × 510 matrices. In 2012, Zhang et al propose a portable and scalable architecture to implement the LU factorization on FPGA (Zhang et al., 2012). They report speedups of 2.2 and 1.7 for the processing of 10000 × 10000 matrices in single precision arithmetic and double precision arithmetic respectively. In order to minimize the use of dividers, the authors use a single reciprocal unit followed by many multipliers running in parallel. The same year, an architecture is proposed to invert a matrix with Gauss-Jordan algorithm in single, 40-bit and double precision (Arias Garcia et al., 2012b). The authors also measure the error propagation due to floating point arithmetic. In (Arias Garcia et al., 2012a), the same authors propose an implementation of Gaussian elimination in single precision arithmetic. In 2013, Moussa et al propose an implementation of Gauss-Jordan algorithm in single precision arithmetic for complex numbers (Moussa et al., 2013). The emphasis is put on the architecture of operators that handle complex numbers. The same year, Chowdhury at al present multicore-oblivious algorithms, including the Gaussian Elimination algorithm(Chowdhury et al., 2013). In 2015, Ziad et al propose a pipelined architecture in single precision arithmetic for solving dense linear systems with reasonable power and area overheads (Ziad et al., 2015). The same year, Faverge et al propose a way to combine the stability of QR and the efficiency of LU factorizations to design high-performance dense linear algebra solvers (Faverge et al., 2015).
7
2.2. Processor arrays and division free implementations The concept of a division free version of Gaussian elimination algorithm was already known in 1968, when Bareiss proposed a two-step optimization of this method (Bareiss, 1968). Actually, Bareiss mentions that the one-step division free method was already known by Jordan himself. In 1993, Kirsch et al propose a scaling technique, which reduces the number of divisions. Nevertheless, the scalings affect the precision (Kirsch and Turner, 1993, 1994). In 1996, Peng at al propose an iterative version of the two-step method and a 2D array processor architecture suitable for VLSI implementation (Peng et al., 1996) that are inspired by Bareiss’ work. The same year, Bermond et al propose a parallel implementation of standard Gaussian elimination on a systolic array (Bermond et al., 1996). In 1997, Bhuvaneswari et al propose a systolic array implementation of the successive Gaussian elimination with 2N processing elements. In 2010, Martinez-Alonzo et al proposed an FPGA implementation of the one-step division free method with an array of processors supporting single or double precision arithmetic (Martinez-Alonso et al., 2010). Nevertheless, the authors do not mention the known problem of the increasing magnitude of data. Since data are multiplied at each step, their magnitude grows exponentially, leading to numbers that can not be handled efficiently. Our proposed architecture looks like an evolution of said architectures. However, we fix the problem of the magnitude; we take into account the pipelined architecture of floating point operators to optimize the performance; and we support pivoting. More details are given in the next sections.
8
3. Foundation In this section, we detail Gauss-Jordan algorithm, which is the foundation of the current work. We also detail the division-free algorithm, mentioned by Bareiss as already known by Jordan (Bareiss, 1968) and a recent architecture inspired by that work: the division-free parallel architecture (MartinezAlonso et al., 2010). Their aim is to solve the system Ax = b, where A is a n × n matrix and x, b are n-element vectors. Both Gauss-Jordan algorithm and the division-free parallel architecture are n-step algorithms aiming at simplifying the matrices A and b such that, at the final step, A is a diagonal matrix (identity in the case of Gauss-Jordan algorithm), leading to a trivial solution for Ax = b. In the following, matrix C is the concatenation of A and b, C = A|b. The content of matrix C at row i and column j after step k is noted Ck (i, j). The ith row of C after step k is noted Ck (i, . . . ). C0 is the initial content of C. 3.1. Gauss-Jordan algorithm Gauss-Jordan algorithm is founded on the following property: replacing a row of C by a linear combination of itself (with a non-zero coefficient) and other rows of C does not affect the solution of the system. Each step k is dedicated to simplify the column k so that, at the end of the step, all the elements of column k are zero, except the diagonal element C(k, k) which has become the unity. When pivoting is supported, each step starts with a row swapping to maximize the absolute value of C(k, k). Pivoting enhances the residual value of the solution and also fixes the problem of a null value for C(k, k). The whole process is described in Algorithm 1.
9
It must be noted that, since after step k, the columns 1 to k are filled with zeros except diagonal elements that are ones, the number of computations actually processed in the most inner loop of Algorithm 1 can be reduced. Moreover, all the computations in step k can be processed in parallel since they only depend on the values computed in step k − 1. 3.2. The division-free algorithm The division-free algorithm (DFA) eliminates the need of divisions until the very last step where A is diagonal. This is particularly interesting for FPGA implementation because divisions are expensive in terms of logical gates and latency. Actually, FPGAs have dedicated logic to compute multiplications, additions and subtractions but not divisions. Eliminating divisions enables much more efficient FPGA implementations. The DFA exploits the fact that a row in C can be multiplied by a non-zero value without affecting the solution of Ax = b. We propose a slightly different illustration of the algorithm presented in (Bareiss, 1968) and used in (Martinez-Alonso et al., 2010), to better illustrate the relation with Gauss-Jordan algorithm. Actually, the DFA described in Algorithm 2 is exactly the same as Algorithm 1 where each row is multiplied by M . As with Gauss-Jordan algorithm, all the computations can be processed in parallel. In (Martinez-Alonso et al., 2010), an array of processors is used to compute each step. One processor is associated to each element of C and the processors can forward the data to their neighbors in order to initialize the array, propagate the multiplicative constants and get the results back. However, multiplying each row by M at each step has an important side effect since the magnitude of M increases at each step too. Actually, the 10
order of the magnitude of each element of C is squared at each step. In fixedpoint arithmetic, this means that the bit-length of each element is doubled at each step. The method is thus not applicable to fixed-point arithmetic. In floating-point arithmetic, it is the magnitudes of the exponents that is doubled at each step, which limits the application of the method to small systems. The authors do not mention this problem. There is another constraint related to logic design, especially in FPGA implementations: complex operators must be pipelined in order to efficiently exploit their computational power. Floating point arithmetic easily leads to data paths with tens of pipeline stages. If a pipelined circuit is used to compute a single evaluation over the whole step, the circuit’s efficiency is only a small fraction of its potential maximum efficiency. Few details are given about the architecture in (Martinez-Alonso et al., 2010) but since a dedicated processor is used for each element of C, it seems that the pipeline is not exploited. 4. Proposed algorithm and pipeline opportunity 4.1. Proposed algorithm The proposed algorithm is illustrated in Algorithm 3. The only difference with the DFA is the division by the power of two estimate of M . In floating point arithmetic, this estimate is actually the exponential part of the number. Dividing by a power of two can be implemented with a simple subtraction applied to the exponent field. This is one of the main contributions of the present paper: normalizing the exponents only requires a single pipeline stage, very few logic resources in an FPGA and it prevents the rows from 11
increasing as in the DFA. Taking advantage of the pipeline requires an analysis of the data dependencies between two steps, since we already know that there are no data dependencies inside a step. It appears that all the computations involve M and L, which are values from row k and row i, computed at the previous step. Moreover, at the end of step k, all the elements of column k are null but Ck (k, k). So, at step k, we should only compute Ck (i, j) where j > k and update the diagonal elements Ck (i, i) for i < k. A practical approach to implement such optimization is illustrated in Algorithm 4, where the diagonal elements are actually stored in an extra column at the end of matrix C. 4.2. Pipeline opportunity A simple example will best illustrate the opportunity offered by pipelined operators. In our example, coherent with floating point operators proposed by Altera, a multiplication requires 5 pipeline stages; an addition requires 7 pipeline stages and the division by a power of 2 requires a single pipeline stage. The heart of Algorithm 4 is the computation processed in lines 16 and 17, which we will name PMSS (Parallel Multiplication, Subtraction, Scaling). It is illustrated in Figure 1. The total latency of the PMSS operator is 13 cycles since the two multiplications are computed in parallel. Let us consider a matrix C with n = 20. If we use a different PMSS operator for each element of C, as in (Martinez-Alonso et al., 2010), we need 20 × 20 = 400 PMSS operators. In a first time, we don’t take the communication delay nor the pivoting delay into account so each step only requires 13 clock cycles and the total computation requires 20 steps ×13 = 260 clock cycles. However, only one stage out of thirteen of each PMSS 12
M
(pipelined) PMSS operator
× -
1 -----|M|
× L
Figure 1: The Parallel Multiplication, Subtraction, Scaling (PMSS) operator
operator is actually processing useful data at each clock cycle, resulting in only 7.7% efficiency. Let us now consider the case where a single PMSS operator is used for each row. Algorithm 4 first computes the M and L values that will be needed at the next step. At step 1, all C0 (i, n + 2) are zeros so we only need 20 clock cycles to feed the PMSS operators and compute values C1 (i, 2 . . . 21)). The M and L values required for step 2 will already be available after 13 clock cycles so the step 2 can start after 20 clock cycles. Step 2 will also require 20 clock cycles to compute values C2 (i, 3 . . . 22). Starting at step 3, the number of operations to compute per step is decremented at each step. Step 3 will require 19 clock cycles, step 4 will require 18 clock cycles ... step 7 will require 14 clock cycles and all the following steps will require 14 cycles too since this is the minimum time to compute the next M and L values and to get the next following element from the master and slave PMSS operators. Finally, step 20 will feed the PMSS operators with only 2 data and the last result will be obtained after 14 clock cycles. As a result, the whole algorithm requires 307 clock cycles (an increase of 18% in time) with only 20 PMSS operators (only 5% of the 400 PMSS operators in the previous configuration).
13
An intermediate configuration would use 2 PMSS operators for each row such that each step would only require 14 clock cycles (13 for last one). In this configuration, we now have 279 clock cycles (an increase of 7% in time) but only 40 PMSS operators, a reduction of 90% of the logic. This example has demonstrated the potential of pipelined resources to implement the proposed algorithm. The next section details the proposed architecture that can fully exploit such resources. 5. Proposed architecture We propose an architecture based on the PMSS operator embedded inside a Row Processing Element (RPE), which eventually also contains a local memory. The global architecture is illustrated in Figure 2. Each row is processed by a different RPE. At step k, all the RPE send the value of their column k to the pivot selector. The largest value is selected as the next M and the associated row becomes the pivot row. A possible implementation is a binary tree of comparators. However, rows are not actually swapped. The RPE associated to the pivot row simply becomes the master and all the other RPE become slaves, with their associated L. Only the RPE that have never been master are eligible to become a master. The role of the master RPE is to send its data, column by column, to all the slaves in parallel through two global buses. It does not make any computation since row k is not modified at step k. M has already been sent to determine the master. Only the elements k + 1 . . . n + 1 are sent by order of increasing column index. The role of the slave RPE associated to row i is to compute the PMSS 14
C(k,j) M
RPE 1
M
M C(i,j) PMSS Operator
Local memory or direct routing
L Pivot selector RPE x PMSS Operator
Memory or routing
PMSS Operator
Memory or routing
RPE n
Figure 2: Global view of the proposed architecture
operation, column by column as soon as the data arrive from the master RPE. L has already been sent to determine the master. Only the elements k + 1 . . . n + 1 are sent by order of increasing column index simultaneously with the master elements. Depending on the number of operations to compute and the latency of the PMSS operator, the results can be stored locally or sent back straightforwardly to the next step. 6. Results This section analyzes the number of clock cycles required by the proposed architecture to solve a system Ax = b, depending on the size of the system and the latency of the PMSS operator. The total time required to compute the solution is actually the number of clock cycles divided by the frequency of the system, which mainly depends on the architecture of the PMSS operator. 15
It is not the purpose of this work to propose the best implementation of the PMSS operator since previous works have already addressed the question of floating point fused operators and the performances are highly dependent of the target FPGA. Nevertheless, we present in this section some synthesis results of different configurations of the PMSS operator in order to give an insight of what is achievable with standard components from the FPGA manufacturer’s library. 6.1. Latency Considering the architecture proposed in Figure 2, we also evaluate the opportunity of using P PMSS operators running in parallel inside each RPE, to best address the case of large systems. Let’s call Lpo the total latency of the PMSS operator, Lpivot the latency to find the pivot row and Lrouting the latency of the routing to send the data in parallel to all the slave RPE (the access to the bus is actually implemented as a large multiplexor, which may require one or several additional clock cycles, depending on the size of the system). The total number of PMSS activations in a RPE at step k is N for the first step and N + 2 − k for the next steps, since the null values don’t need to be recomputed. When P > 1, the total number of activations per PMSS instance must be divided per P (rounded up). The last step has only two PMSS activations so the number of clock cycles required by the last step is L + 1 when P = 1 and L as soon as P > 1. These considerations are the foundation of Algorithm 5, which computes the total number of clock cycles from N , Lpivot , Lrouting , Lpo and P . In the first step, we take into account the computation of the first pivot row, the 16
sending of all the data to the PMSS operator and the computation of the second pivot row. It must be noted that the calculus of the pivot row can start as soon as the first data is output from the PMSS operator. Moreover, Lpivot should never be null because it takes at least one more clock cycle to receive the second data from the PMSS operator and be ready to start a new step, except for step one. The results presented in (David, 2015) did not take the pivoting into account. In such case, Lpivot = 1 (as mentioned previously) and Lpo includes the latency of the routing. A condensed version of these results is presented in Table 1 (P=1), Table 2 (P=2) and Table 3 (P=4). When pivoting is used, its latency can be estimated as Lpivot = dLog2 (n)e since we propose to use a binary tree of comparators. The latency of the routing is more difficult to estimate because it depends on the target technology. It can be implemented by a bus or a multiplexer of size n. In this paper, we consider Lrouting = dLog10 (n)e. Results are presented in Table 4 (P=1), Table 5 (P=2) and Table 6 (P=4). 6.2. FPGA resources The results presented in this section have been computed with Quartus II software 64-bit (Altera), version 15.0.0 build 145. We chose the FPGA Stratix V 5SGSMD8N1F45C2, which has the highest number of DSP blocks (1963) and 262400 ALM. We use the default synthesis options (Balanced - Normal flow). Reported frequencies have been computed with the Slow 900mV 85C model (post place and route). The floating point multiplier altfp mult, proposed by Altera, can be synthesized with 5, 6, 10 or 11 pipeline stages with custom sizes for exponent 17
and mantissa fields. We have synthesized both single precision format (8-bit exponent, 23-bit mantissa) and double precision format (11-bit exponent, 52-bit mantissa). Synthesis results are reported in Table 7. The floating point subtractor altfp add sub, proposed by Altera, can be synthesized with 7 to 14 pipeline stages. A few test cases synthesis results are reported in Table 8. We define the configuration format of the PMSS operator:
. The synthesis results obtained suggest to try the following configuration formats: 5-7 and 5-12 in single precision, 5-7, 6-7 and 10-7 in double precision. This way, we minimize the latency for a given frequency. It must be understood that, when we combine two operators, only the lowest frequency is relevant. Synthesis results of the PMSS operator are reported in Table 9. These latest results suggest to finally keep single precision formats 5-7, 5-12 and double precision format 5-7 for the PMSS operator. It appears that the results presented in (David, 2015) contain a flaw for the configuration 5 − 12(18). Due to an error in the collection of results, the latency used for that configuration (Tables VII, VIII and IX in said document) was 28 instead of 18, which resulted in a global latency higher than the actual one for that configuration. Table 10 presents a condensed view of the total latency without pivoting as it should have appeared in (David, 2015). Table 11 shows the total latency of the proposed architecture with pivoting, in the conditions already mentioned. Both tables are founded on the frequencies obtained in Table 9.
18
6.3. A complete system implementation In order to validate the proposed architecture on a complete system, we have implemented a generic solver in VHDL and we have synthesized the system for several sizes n ranging from 4 to 100. We use the same FPGA Stratix V 5SGSMD8N1F45C2 and the same synthesis options as the ones reported in the begin of this section. Each RPE has a single 5-7 PMSS operator (single precision, P = 1, Lpo = 13). The pivot selector is a generic binary tree of comparators. However, the comparators only compare the exponent fields since they adequately represent the amplitude of the numbers. Our aim here is simply to validate that the proposed Algorithm 5 can accurately compute the total latency and that the proposed architecture can really compute the solution of a system Ax = B. In this context, the proposed implementation has not been fully optimized. The latency of the pivot finder could be lowered. The routing latency could better fit the size of the system and the latency of the PMSS operators could be lowered too by using fused operators. The synthesis results of the complete system are reported in Table 12. The architecture has been validated at RTL for n = 10 and n = 20. We have used random matrices A and b and the results are perfectly coherent with a high level model of the algorithm written in Java. The number of clock cycles computed by Algorithm 5 has also been confirmed by the simulation. 7. Discussion The proposed algorithm and architecture have been designed to minimize the latency as much as possible, which makes them particularly useful for 19
small systems that must compute a solution in real time. For large systems such as the ones targeted in (Zhang et al., 2012), the latency of the floating point operators is completely absorbed by the length of the rows. This can be observed in Figure 8 of said paper; the performance falls to zero when the size of the system becomes too low (compared to thousands of equations). Previous works don’t give many details on the total latency for small systems. In (Arias Garcia et al., 2012a), Figure 4 illustrates a latency varying from approximately 100us to 1400us when n varies from 10 to 100 (single precision arithmetic). Our results show a total latency varying between 0.543us and 6.534us in the same conditions, which is more than two orders of magnitude improvement. It is always difficult and hazardous to compare the performances between different FPGA implementations since many parameters may be different. This is why we have applied a conservative methodology to evaluate our performances with standard floating point resources and standard synthesis options. We are confident that a fused PMSS operator would reduce the number of pipeline stages and a fine tuning of the place and route process may also lead to better timings. Altera recently proposed the Aria 10 and Stratix 10 series, which embed hardened floating-point DSP blocks. In this context, the latency of the PMSS operator could be as low as four clock cycles for single precision arithmetic. Nevertheless, the contributions of this paper are the proposed algorithm and architecture, not the estimation of a particular implementation performance. The results presented in this paper enable a better comparison to previous works. Compared to results presented in (David, 2015), the implementation of
20
the pivoting increases the total latency from 2.5% (single precision, n=100, P=1, Lpo = 13) to 50% (single precision, n=40, P=4, Lpo = 13), depending on the configuration. The more the pivoting is in the critical path, the higher the latency overhead. Nevertheless, it is still possible to solve 100-equation systems in single precision within ten microseconds. It must also be noted that the final result is a diagonal matrix. A last division step may be required if the final values are required. Such resource can be shared by several or all the RPE. The ALTFP DIV divider proposed by Altera requires 33 pipeline stages in single precision arithmetic and 61 stages in double precision arithmetic. Evaluating the benefit of the proposed solution on the precision is a difficult task. Actually, floating point operators are not associative as illustrated in the following operations on double precision format :
10−50 + (−10−50 ) + 1 + (−1) = ? (10−50 + (−10−50 )) + (1 + (−1)) = 0 10−50 + (((−10−50 ) + 1) + (−1)) = 10−50 (−10−50 ) + ((10−50 + 1) + (−1)) = −10−50 In floating point, a given system of equations can thus lead to different solutions by simply swapping two variables. Division-free algorithms theoretically enhance the precision since they avoid the division round-off error and thus increase the numerical stability of the algorithm (Peng et al., 1996). The proposed algorithm replaces the divisions of the Gauss-Jordan algorithm by error-free scalings. We would have appreciated to measure an increase of 21
the precision. Nevertheless, in our experiments with double precision arithmetic on random systems, we have not been able to observe a significant difference between the precision of the proposed approach and the precision of the standard Gauss-Jordan algorithm. 8. Conclusion We have proposed a division free algorithm and a corresponding parallel and pipelined architecture based on well-known Gauss-Jordan algorithm to optimize the solving of linear systems at low latency. Results show two orders of magnitude improvement compared to previous works since a hundred equation system can be solved in less than ten microseconds with a state of the art FPGA in single or double precision arithmetic. We have also proposed an improvement of the architecture to support pivoting. When pivoting is enabled, the latency overhead varies from 2.5% to 50%, depending on the configuration of the architecture and the size of the system. This work confirms that, today, FPGAs are powerful targets for highly parallel applications involving floating point arithmetic in single or double precision.
22
Altera, 2014. Floating-point ip cores user guide. URL https://www.altera.com Arias-Garcia, J., Braga, A., Llanos, C., Ayala-Rincon, M., Pezzuol Jacobi, R., Foltran, A., Feb 2013. Fpga hil simulation of a linear system block for strongly coupled system applications. In: Industrial Technology (ICIT), 2013 IEEE International Conference on. pp. 1017–1022. Arias Garcia, J., Llanos, C., Ayala Rincon, M., Jacobi, R., Feb 2012a. A fast and low cost architecture developed in fpgas for solving systems of linear equations. In: Circuits and Systems (LASCAS), 2012 IEEE Third Latin American Symposium on. pp. 1–4. Arias Garcia, J., Llanos, C., Ayala Rincon, M., Jacobi, R., March 2012b. Fpga implementation of large-scale matrix inversion using single, double and custom floating-point precision. In: Programmable Logic (SPL), 2012 VIII Southern Conference on. pp. 1–6. Bareiss, E. H., 1968. Sylvesters identity and multistep integer-preserving gaussian elimination. Mathematics of computation 22 (103), 565–578. Bermond, J.-C., Peyrat, C., Sakho, I., Tchuent´e, M., Feb. 1996. Parallelization of the gaussian elimination algorithm on systolic arrays. J. Parallel Distrib. Comput. 33 (1), 69–75. URL http://dx.doi.org/10.1006/jpdc.1996.0025 B¨ohm, W., Hammes, J., 2004. A transformational approach to high performance embedded computing. Tech. rep., DTIC Document.
23
Chen, Y., Dinavahi, V., July 2011. An iterative real-time nonlinear electromagnetic transient solver on fpga. In: Power and Energy Society General Meeting, 2011 IEEE. pp. 1–8. Chowdhury, R. A., Ramachandran, V., Silvestri, F., Blakeley, B., 2013. Oblivious algorithms for multicores and networks of processors. Journal of Parallel and Distributed Computing 73 (7), 911–925. Daga, V., Govindu, G., Prasanna, V., Gangadharpalli, S., Sridhar, V., 2004. Efficient floating-point based block lu decomposition on fpgas. In: in Proceedings of the 2004 International Conference on Engineering Reconfigurable Systems and. pp. 276–279. David, J. P., Dec 2015. Low latency solver for linear equation systems in floating point arithmetic. In: 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig). pp. 1–7. de Matos, G., Neto, H., Aug 2006. On reconfigurable architectures for efficient matrix inversion. In: Field Programmable Logic and Applications, 2006. FPL ’06. International Conference on. pp. 1–6. de Matos, G., Neto, H., Feb 2007. Memory optimized architecture for efficient gauss-jordan matrix inversion. In: Programmable Logic, 2007. SPL ’07. 2007 3rd Southern Conference on. pp. 33–38. Duarte, R., Neto, H., Vestias, M., Aug 2009. Double-precision gauss-jordan algorithm with partial pivoting on fpgas. In: Digital System Design, Architectures, Methods and Tools, 2009. DSD ’09. 12th Euromicro Conference on. pp. 273–280. 24
Dufour, C., Cense, S., Ould-Bachir, T., Gregoire, L., Belanger, J., Oct 2012. General-purpose reconfigurable low-latency electric circuit and motor drive solver on fpga. In: IECON 2012 - 38th Annual Conference on IEEE Industrial Electronics Society. pp. 3073–3081. Faverge, M., Herrmann, J., Langou, J., Lowery, B., Robert, Y., Dongarra, J., 2015. Mixing lu and qr factorization algorithms to design high-performance dense linear algebra solvers. Journal of Parallel and Distributed Computing 85, 32–46. Govindu, G., Scrofano, R., Prasanna, V. K., 2005. A library of parameterizable floating-point cores for fpgas and their application to scientific computing. In: In Proc. of International Conference on Engineering Reconfigurable Systems and Algorithms. pp. 137–148. Johnson, J., Nagvajara, P., Nwankpa, C., 2004. High-performance linear algebra processor using fpga. Tech. rep., DTIC Document. Kirsch, B. J., Turner, P. R., Jun 1993. Adaptive beamforming using rns arithmetic. In: Computer Arithmetic, 1993. Proceedings., 11th Symposium on. pp. 36–43. Kirsch, B. J., Turner, P. R., 1994. Modified gaussian elimination for adaptive beam forming using rns arithmetic. Tech. rep., DTIC Document. Lienhart, G., Kugel, A., Manner, R., 2002. Using floating-point arithmetic on fpgas to accelerate scientific n-body simulations. In: Field-Programmable Custom Computing Machines, 2002. Proceedings. 10th Annual IEEE Symposium on. pp. 182–191. 25
Martinez-Alonso, R., Mino, K., Torres-Lucio, D., Sept 2010. Array processors designed with vhdl for solution of linear equation systems implemented in a fpga. In: Electronics, Robotics and Automotive Mechanics Conference (CERMA), 2010. pp. 731–736. Moussa, S., Razik, A., Dahmane, A., Hamam, H., May 2013. Fpga implementation of floating-point complex matrix inversion based on gauss-jordan elimination. In: Electrical and Computer Engineering (CCECE), 2013 26th Annual IEEE Canadian Conference on. pp. 1–4. Ould-Bachir, T., Blanchette, H. F., Al-Haddad, K., June 2015. A network tearing technique for fpga-based real-time simulation of power converters. IEEE Transactions on Industrial Electronics 62 (6), 3409–3418. Ould-Bachir, T., Dufour, C., Belanger, J., Mahseredjian, J., David, J.-P., May 2012. Effective floating-point calculation engines intended for the fpga-based hil simulation. In: Industrial Electronics (ISIE), 2012 IEEE International Symposium on. pp. 1363–1368. Peng, S., Sedukhin, S., Sedukhin, I., Aug 1996. Parallel algorithm and architecture for two-step division-free gaussian elimination. In: Application Specific Systems, Architectures and Processors, 1996. ASAP 96. Proceedings of International Conference on. pp. 183–192. Prasanna, V., Morris, G., March 2007. Sparse matrix computations on reconfigurable hardware. Computer 40 (3), 58–64. Shen, K., Crossley, J., Lun, A., Liu, H., 1999. The Nine Chapters on the
26
Mathematical Art: Companion and Commentary. Oxford University Press. URL https://books.google.ca/books?id=eiTJHRGTG6YC Zhang, W., Betz, V., Rose, J., Mar. 2012. Portable and scalable fpga-based acceleration of a direct linear system solver. ACM Trans. Reconfigurable Technol. Syst. 5 (1), 6:1–6:26. URL http://doi.acm.org/10.1145/2133352.2133358 Zhuo, L., Prasanna, V. K., Aug 2008. High-performance designs for linear algebra operations on reconfigurable hardware. IEEE Transactions on Computers 57 (8), 1057–1071. Ziad, M. T. I., Alkabani, Y., El-Kharashi, M. W., Aug 2015. On hardware solution of dense linear systems via gauss-jordan elimination. In: Communications, Computers and Signal Processing (PACRIM), 2015 IEEE Pacific Rim Conference on. pp. 364–369.
27
Algorithm 1: Gauss-Jordan algorithm Data: Matrices A(n × n) and b(n × 1) Result: C = A|b is transformed so that A is identity and still Ax = b 1 2
for k=1 to n do if (pivoting) then
3
Find Ck−1 (m, k) | ∀n 6= m, |Ck−1 (n, k)| < |Ck−1 (m, k)|
4
Swap rows k and m
5
end
6
M = Ck−1 (k, k)
7
for i=1 to n do
8
L = Ck−1 (i, k)
9
if (k == i) then
10
// Make Ck (i, k) one
11
Ck (k, . . . ) =
1 C (k, . . . ) M k−1
else
12 13
// Make Ck (i, k) zero
14
Ck (i, . . . ) = Ck−1 (i, . . . ) − end
15 16 17
end end
28
L C (k, . . . ) M k−1
Algorithm 2: The division-free algorithm (Martinez-Alonso et al., 2010) Data: Matrices A(n × n) and b(n × 1) Result: C = A|b is transformed so that A is diagonal and still Ax = b 1 2
for k=1 to n do if (pivoting) then
3
Find Ck−1 (m, k) | ∀n 6= m, |Ck−1 (n, k)| < |Ck−1 (m, k)|
4
Swap rows k and m
5
end
6
M = Ck−1 (k, k)
7
for i=1 to n do
8
L = Ck−1 (i, k)
9
if (k == i) then
10
// row k is not modified
11
Ck (k, . . . ) = Ck−1 (k, . . . ) else
12 13
// Make Ck (i, k) zero
14
Ck (i, . . . ) = M Ck−1 (i, . . . ) − LCk−1 (k, . . . ) end
15 16 17
end end
29
Algorithm 3: Proposed algorithm Data: Matrices A(n × n) and b(n × 1) Result: C = A|b is transformed so that A is diagonal and still Ax = b 1
Initialize C0 = A|b
2
for k=1 to n do
3
if (pivoting) then
4
Find Ck−1 (m, k) | ∀n 6= m, |Ck−1 (n, k)| < |Ck−1 (m, k)|
5
Swap rows k and m
6
end
7
M = Ck−1 (k, k)
8
for i=1 to n do L = Ck−1 (i, k)
9
if (k == i) then
10 11
// row k is not modified
12
Ck (k, . . . ) = Ck−1 (k, . . . ) else
13 14
// Make Ck (i, k) zero
15
Ck (i, . . . ) = M Ck−1 (i, . . . ) − LCk−1 (k, . . . )
16
Ck (i, . . . ) = end
17 18 19
1 C (i, . . . ) 2x estimate of M k
end end
30
Algorithm 4: Proposed simplified algorithm Data: Matrices A(n × n) and b(n × 1) Result: C = A|b|d where d(i)x(i) = b(i) 1
Initialize C0 = A|b|0
2
for k=1 to n do
3
if (pivoting) then
4
Find Ck−1 (m, k) | ∀n 6= m, |Ck−1 (n, k)| < |Ck−1 (m, k)|
5
Swap rows k and m
6
end
7
M = Ck−1 (k, k)
8
for i=1 to n do L = Ck−1 (i, k)
9
if (k == i) then
10 11
// row k is not modified
12
Ck (k, k + 1 . . . n + 1) = Ck−1 (k, k + 1 . . . n + 1)
13
Ck (k, n + 2) = Ck−1 (k, k) else
14 15
// Make Ck (i, k) zero
16
Ck (i, k + 1 . . . n + 2) = M Ck−1 (i, k + 1 . . . n + 2) − LCk−1 (k, k + 1 . . . n + 2) Ck (i, k + 1 . . . n + 2) =
17
end
18 19 20
1 C (i, k 2x estimate of M k
end end
31
+ 1 . . . n + 2)
Algorithm 5: Evaluation of the total latency Data: n: the size of the system Data: Lpo : the latency of the PMSS operator Data: Lpivot : the latency to find the pivot row Data: Lrouting : the latency to route the data from the master PMSS to slave PMSS Data: P : the number of PMSS per RPE Result: nb cycles: The total number of clock cycles to make the system diagonal 1
LP R = Lpivot + Lrouting
2
//First step
3
nb cycles = LP R − 1 + max(Lpo + LP R , dn/P e)
4
//Internal steps
5
for k=2 to n-1 do
6
nb cycles+ = max(Lpo + LP R , d(n + 2 − k)/P e)
7
end
8
//Last step
9
if (P == i) then
10 11 12
nb cycles+ = Lpo + 1 else nb cycles+ = Lpo
13
end
14
return nb cycles
32
Table 1: Total latency for P=1 (clock cycles)
n
Lpo =13 Lpo =16 Lpo =19 Lpo =22 Lpo =25 Lpo =28 Lpo =31
4
56
68
80
92
104
116
128
8
112
136
160
184
208
232
256
12
168
204
240
276
312
348
384
16
229
272
320
368
416
464
512
20
307
349
400
460
520
580
640
40
937
979
1030
1090
1159
1237
1324
60
1967
2009
2060
2120
2189
2267
2354
80
3397
3439
3490
3550
3619
3697
3784
100
5227
5269
5320
5380
5449
5527
5614
Table 2: Total latency for P=2 (clock cycles)
n
Lpo =13 Lpo =16 Lpo =19 Lpo =22 Lpo =25 Lpo =28 Lpo =31
4
55
67
79
91
103
115
127
8
111
135
159
183
207
231
255
12
167
203
239
275
311
347
383
16
223
271
319
367
415
463
511
20
279
339
399
459
519
579
639
40
607
694
799
919
1039
1159
1279
60
1127
1214
1319
1442
1583
1742
1919
80
1847
1934
2039
2162
2303
2462
2639
100
2767
2854
2959
3082
3223
3382
3559
33
Table 3: Total latency for P=4 (clock cycles)
n
Lpo =13 Lpo =16 Lpo =19 Lpo =22 Lpo =25 Lpo =28 Lpo =31
4
55
67
79
91
103
115
127
8
111
135
159
183
207
231
255
12
167
203
239
275
311
347
383
16
223
271
319
367
415
463
511
20
279
339
399
459
519
579
639
40
559
679
799
919
1039
1159
1279
60
844
1019
1199
1379
1559
1739
1919
80
1209
1386
1599
1839
2079
2319
2559
100
1674
1851
2064
2313
2599
2899
3199
Table 4: Total latency for P=1 (clock cycles), with pivoting n
Lpivoting
Lrouting
Lpo =13 Lpo =16 Lpo =19 Lpo =22 Lpo =25 Lpo =28 Lpo =31
4
2
1
64
76
88
100
112
124
136
8
3
1
136
160
184
208
232
256
280
12
4
2
228
264
300
336
372
408
444
16
4
2
304
352
400
448
496
544
592
20
5
2
400
460
520
580
640
700
760
40
6
2
1049
1112
1184
1265
1355
1454
1562
60
6
2
2079
2142
2214
2295
2385
2484
2592
80
7
2
3529
3595
3670
3754
3847
3949
4060
100
7
2
5359
5425
5500
5584
5677
5779
5890
34
Table 5: Total latency for P=2 (clock cycles), with pivoting n
Lpivoting
Lrouting
Lpo =13 Lpo =16 Lpo =19 Lpo =22 Lpo =25 Lpo =28 Lpo =31
4
2
1
63
75
87
99
111
123
135
8
3
1
135
159
183
207
231
255
279
12
4
2
227
263
299
335
371
407
443
16
4
2
303
351
399
447
495
543
591
20
5
2
399
459
519
579
639
699
759
40
6
2
839
959
1079
1199
1319
1439
1559
60
6
2
1358
1487
1634
1799
1979
2159
2339
80
7
2
2119
2254
2407
2578
2767
2974
3199
100
7
2
3039
3174
3327
3498
3687
3894
4119
Table 6: Total latency for P=4 (clock cycles), with pivoting n
Lpivoting
Lrouting
Lpo =13 Lpo =16 Lpo =19 Lpo =22 Lpo =25 Lpo =28 Lpo =31
4
2
1
63
75
87
99
111
123
135
8
3
1
135
159
183
207
231
255
279
12
4
2
227
263
299
335
371
407
443
16
4
2
303
351
399
447
495
543
591
20
5
2
399
459
519
579
639
699
759
40
6
2
839
959
1079
1199
1319
1439
1559
60
6
2
1259
1439
1619
1799
1979
2159
2339
80
7
2
1759
1999
2239
2479
2719
2959
3199
100
7
2
2226
2499
2799
3099
3399
3699
3999
35
Table 7: Multiplier synthesis results (Freq in Mhz)
Multiplication Single precision Stages Freq
Double precision
ALM
Reg.
DSP
Freq
ALM
Reg.
DSP
5
444
94
190
1
292
181
296
4
6
446
103
252
1
339
205
474
4
10
451
145
457
1
367
293
862
4
11
460
155
493
1
364
310
936
4
Table 8: Subtractor synthesis results (Freq in Mhz)
Subtraction Single precision Stages Freq
ALM Reg.
Double precision DSP
Freq
ALM
Reg.
DSP
7
400
364
401
0
370
739
740
0
11
404
354
744
0
364
713
1372
0
12
550
363
788
0
466
726
1452
0
36
Table 9: PMSS operator synthesis results (Freq in Mhz)
PMSS operator Single precision Config
Freq
ALM
Reg.
DSP Latency
5-7
256
520
943
2
13
5-12
439
565
1254
2
18
Double precision Config
Freq
ALM
Reg.
DSP
5-7
205
1105
1569
8
13
6-7
210
1122
1813
8
14
10-7
207
1255
2592
8
18
Table 10: Total latency (us) without pivoting P=1 single Size 5-7 (13) 4
0.219
single 5-12 (18)
P=2 double
single
5-7(13) Size
0.173
0.273
5-7 (13)
single
P=4 double
5-12 (18) 5-7(13) Size
4
0.215
0.171
0.268
4
single
single
double
5-7 (13)
5-12 (18)
5-7(13)
0.215
0.171
0.268
8
0.437
0.347
0.546
8
0.433
0.344
0.541
8
0.433
0.344
0.541
12
0.656
0.520
0.819
12
0.652
0.518
0.814
12
0.652
0.518
0.814
16
0.894
0.693
1.117
16
0.870
0.691
1.088
16
0.870
0.691
1.088
20
1.198
0.871
1.497
20
1.089
0.864
1.361
20
1.089
0.864
1.361
40
3.657
2.307
4.570
40
2.369
1.737
2.960
40
2.182
1.731
2.726
60
7.677
4.656
9.593
60
4.399
2.923
5.496
60
3.294
2.597
4.116
80
13.259
7.916
16.567
80
7.209
4.565
9.008
80
4.719
3.475
5.896
100
20.401
12.088
25.493
100
10.800
6.662
13.495
100
6.534
4.535
8.164
37
Table 11: Total latency (us) with pivoting P=1 single Size 5-7 (13)
P=2
single
double
5-12 (18)
single
5-7(13) Size
P=4
single
5-7 (13)
double
5-12 (18) 5-7(13) Size
single
single
double
5-7 (13)
5-12 (18)
5-7(13)
4
0.250
0.192
0.312
4
0.246
0.189
0.307
4
0.246
0.189
0.307
8
0.531
0.401
0.663
8
0.527
0.399
0.658
8
0.527
0.399
0.658
12
0.890
0.657
1.112
12
0.886
0.654
1.107
12
0.886
0.654
1.107
16
1.187
0.876
1.483
16
1.183
0.873
1.478
16
1.183
0.873
1.478
20
1.561
1.140
1.951
20
1.557
1.138
1.946
20
1.557
1.138
1.946
40
4.094
2.642
5.116
40
3.275
2.369
4.092
40
3.275
2.369
4.092
60
8.114
4.991
10.139
60
5.300
3.609
6.623
60
4.914
3.554
6.140
80
13.774
8.308
17.211
80
8.271
5.367
10.335
80
6.865
4.922
8.579
100
20.916
12.481
26.136
100
11.861
7.465
14.821
100
8.688
6.154
10.856
Table 12: Synthesis results of the complete system
n
Lpivoting
Lrouting
Freq
ALM
Registers
Memory bits
DSP blocks
4
5
2
204.5
3026
5417
1024
8
8
6
2
196.77
6122
101778
4096
16
12
6
2
219.83
9092
16107
6936
24
16
7
2
219.39
12303
21570
17440
32
20
7
2
219.25
14986
26920
21800
40
40
8
2
214.55
30039
53957
84560
80
60
8
2
195.01
45379
80282
126840
120
80
9
2
182.98
61110
106914
332960
160
100
9
2
177.81
76363
133597
416200
200
38
Author Biography & Photograph
Jean Pierre David received the degree in electrical engineering from Université de Liège, Belgium, in 1995 and the Ph.D. degree from Université Catholique de Louvain, Louvain-la-Neuve, Belgium, in 2002. He has been an Assistant Professor with Université de Montréal, QC, Canada, for three years and has been an Assistant Professor with Polytechnique Montréal, QC, Canada, in 2006. He is associate professor since 2013. His research interests include digital system design, reconfigurable computing, high-level synthesis and their applications.