Low-power pipelined LMS adaptive filter architectures with minimal adaptation delay1

INTEGRATION, the VLSI journal 27 (1999) 1—32 Low-power pipelined LMS adaptive filter architectures with minimal adaptation delay S. Ramanathan*, V. ...

Download PDF

590KB Sizes 0 Downloads 31 Views

Report

PDF Reader
Full Text

INTEGRATION, the VLSI journal 27 (1999) 1—32

Low-power pipelined LMS adaptive filter architectures with minimal adaptation delay S. Ramanathan*, V. Visvanathan Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore 560 012, India Received 11 May 1998; accepted 4 June 1998

Abstract The use of delayed coefficient adaptation in the least mean square (LMS) algorithm has enabled the design of pipelined architectures for real-time transversal adaptive filtering. However, the convergence speed of this delayed LMS (DLMS) algorithm, when compared with that of the standard LMS algorithm, is degraded and worsens with increase in the adaptation delay. Existing pipelined DLMS architectures have large adaptation delay and hence degraded convergence speed. We in this paper, first present a pipelined DLMS architecture with minimal adaptation delay for any given sampling rate. The architecture is synthesized by using a number of function preserving transformations on the signal flow graph representation of the DLMS algorithm. With the use of carry-save arithmetic, the pipelined architecture can support high sampling rates, limited only by the delay of a full adder and a 2-to-1 multiplexer. In the second part of this paper, we extend the synthesis methodology described in the first part, to synthesize pipelined DLMS architectures whose power dissipation meets a specified budget. This low-power architecture exploits the parallelism in the DLMS algorithm to meet the required computational throughput. The architecture exhibits a novel tradeoff between algorithmic performance (convergence speed) and power dissipation. 1999 Elsevier Science B.V. All rights reserved. Keywords: Adaptive filtering; Least mean squares algorithm; Low power; Pipelined architectures; Systolic architectures; Configurable processor array

1. Introduction The least mean squares (LMS) algorithm for transversal adaptive filtering [1,2] has been extremely popular and successful for diverse applications in areas like communications, control, and signal processing. This has been due to the algorithm’s low computational complexity and stable convergence behaviour. However, this algorithm is not amenable to pipelining [3], a technique indispensable in the design of low-power or high-speed architectures, with minimal area

* Corresponding author. E-mail: [email protected]. This paper is an extended version of results presented at VLSI Design’96 [17] and VLSI Design’97 [22]. 0167-9260/99/$ — see front matter 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 9 2 6 0 ( 9 8 ) 0 0 0 1 3 - 3

2

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

penalty [4]. The problem stems from the absence of sufficient number of delays (registers) in the error feedback loop which provides the filter’s prediction error to all the taps in order to update the filter coefficients. This problem can be overcome by the use of delayed coefficient adaptation in the LMS algorithm [3], which has been an important contribution towards developing pipelined LMS adaptive filter architectures. The basic idea is to update the filter coefficients using a delayed value of the error, which provides the required registers in the error feedback loop to pipeline the filter architecture. However, as has been documented in [3], the convergence speed of this delayed LMS (DLMS) algorithm degrades progressively with increase in the adaptation delay. This degradation of convergence speed is of particular concern in a nonstationary environment, as it could lead to a loss of tracking capability. Long et al. [3] therefore recommend that “every effort should still be made to keep the delay as small as possible if it is not avoidable”. In the context of a real-time environment — which is the concern of this paper — this would imply that a good architectural synthesis methodology is one that uses very few adaptation delays for a slow sampling rate application (e.g. speech) while requiring a larger number of delays for higher sampling rates (e.g. video). Based on the DLMS algorithm, many researchers have proposed pipelined architectures [5—16]. These architectures either support high sampling rate or exhibit improved convergence performance achieved at the cost of either of these and/or area. Further, most of these architectures suffer from large latency. We, in the first part of this paper, present a synthesis methodology to derive pipelined DLMS architectures that overcome the above-mentioned problems for any given sampling rate [17]. The proposed architecture thus has better convergence speed and the input/output latency is a small constant independent of the filter order. These improvements are illustrated through examples. The improvements have been achieved by using the associativity of addition in a novel manner and by exploiting the slowness of the sampling rate with respect to the hardware speed. The above ideas get encapsulated in the transformation-based architecture synthesis methodology followed in this paper, wherein, a sequence of transformations are applied on the signal flow graph representation of the DLMS algorithm to arrive at the final pipelined architecture. The standard transformations like holdup, associativity of addition, retiming [18], slowdown and folding [19] are used, all of which preserve the functionality of the original algorithm and hence the final architecture is correct by construction. The use of such transformations for optimizing area and/or speed is well known (see [20] for an excellent review). Recently, they have also been used to minimize power [21]. However, this is the first instance of its use in improving algorithmic robustness. Apart from the standard transformations mentioned above, the methodology also uses some special transformations which are presented in this paper and proved to be function preserving. Using carry-save arithmetic, the critical path of the new architecture consists of a full adder and a 2-to-1 multiplexer. The resulting fine-grain pipelined architecture with minimal adaptation delay can be used either for high sampling rate or for low-power applications. It therefore appears that, the use of sum relaxation [7] or vector processing [8] techniques with their attendent area penalty and convergence degradation is not necessary for high sampling rate or low-power applications. In the second part of the paper, we extend the synthesis methodology described in the first part, to synthesize pipelined architectures with minimal adaptation delay subject to a power constraint [22]. In order to meet the given power constraint, the supply voltage is treated as a free variable.

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

3

The resulting low-power architecture exploits the parallelism in the DLMS algorithm to meet the required computational throughput. An interesting and novel tradeoff between algorithmic performance and power dissipation is exhibited by this architecture. This tradeoff is illustrated through examples. The organization of the paper is as follows. In Section 2, we review the DLMS algorithm and highlight the relationship between convergence speed and adaptation delay through system identification and adaptive line enhancer examples. Further, we review the existing pipelined architectures [5—16]. In Section 3, we present the synthesis methodology to derive pipelined DLMS architectures with minimal adaptation delay for any given sampling rate. The performance improvement of the proposed architecture with respect to the existing pipelined architectures is highlighted through examples. The fine-grain pipelined architecture derived using carry-save arithmetic is also discussed. In Section 4, we extend the synthesis methodology presented in Section 3 to incorporate a power constraint. We also discuss the novel tradeoff between convergence speed and power dissipation exhibited by the resulting low-power architecture and substantiate this tradeoff through examples. We summarize the work in Section 5.

2. The DLMS algorithm and existing pipelined architectures The DLMS adaptive filtering algorithm [3] is given by ,\ y(n)" a (n)x(n!k), I I a (n#1)"a (n)#ke(n!D )x(n!k!D ), 0)k)(N!1), I I e(n)"d(n)!y(n),

(1) (2) (3)

where N is the filter order, D is the number of input sample periods by which the adaptation is delayed (referred to as the adaptation delay), k is the adaptation step size, +x(n), is the input sequence, +d(n), is the desired response sequence, +y(n), is the output sequence, +a (n), 0)k)N!1, is the set of time-varying filter coefficients, +e(n), is the adaptation error I sequence, n3Z, the set of integers. Fig. 1 shows the standard signal-flow-graph (SFG) representation of the DLMS adaptive filtering algorithm. The D registers present at the inputs to the coefficient update block are used to pipeline the adaptive filter architecture, thereby alleviating the problems faced by the LMS algorithm. However, the convergence speed of the DLMS algorithm is degraded when compared with that of the LMS algorithm [3]. This degradation worsens with increase in D as illustrated through the examples described below. Example 1(a). The task is to identify the coefficients of an FIR filter of order 128 by using a DLMS adaptive filter of the same order. The input is a first order auto-regressive process (AR(1), x(n)"ox(n!1)#w(n), where w(n) is a white noise process) of unit power. The measurement noise (noise(n)) is an independent additive zero-mean white noise of power 10\. Fig. 2 shows the simulation results obtained by averaging over 500 runs with D "0, 8, 16, 32, 64, and 128 for

4

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Fig. 1. Standard signal-flow-graph representation of the DLMS adaptive filter.

Fig. 2. Convergence plots for Example 1(a).

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

5

Fig. 3. Convergence plots for Example 1(b).

o"0.1. Note that, to obtain these mean squared error convergence plots, the step-size for each case was chosen to get the fastest convergence while ensuring that the misadjustment was below 100%. Progressive degradation in the convergence speed with increase in D can be observed, which demonstrates the relevance of minimizing D . Example 1(b). The task is to enhance the power of the sinusoids corrupted by broadband noise. The input to the 64-tap ALE consists of two sine waves of power and frequency p"0.5, f "1 MHz, and p"0.5, f "2.5 MHz, corrupted by an independent broadband additive gaus sian noise of power p"0.1. The sampling frequency is 10 MHz, the decorrelation delay D is 1, and L the Weiner error for this setting is 0.1065. Fig. 3 shows the simulation results obtained by averaging over 1000 runs with D "0, 8, 16, 32, 64. Note that, to obtain these mean squared error convergence plots, the step-size for each case was chosen to get the fastest convergence while ensuring that the misadjustment was below 30%. Progressive degradation in the convergence speed with increase in D can be observed, which once again demonstrates the relevance of minimizing D . A number of pipelined architectures have been proposed for the DLMS algorithm [5—16] These architectures can be classified broadly into two categories as described below: (A) Pipelined architectures based on the standard DLMS algorithm [3]. Herzberg et al. [5] and Meyer et al. [6] have presented architectures with large adaptation delays of (N!1) and N,

6

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

respectively, to support a sampling rate of 1/[3(¹ #¹ )] and 1/[2(¹ #¹ )]. The architecture

proposed by Shanbhag et al. [7] and Meyer et al. [8], using the technique of relaxed lookahead and vectorization, respectively, for supporting high sampling rates, have large adaptation delays ('N) and large hardware overhead. All of these bit-parallel architectures suffer from degraded convergence behaviour due to the use of large adaptation delays. Further, they have large latency. Preliminary attempts to reduce adaptation delays in bit-parallel architectures for improved convergence behaviour have been reported by Visvanathan et al. [9] and Thomas [10]. However, these architectures could succeed in minimizing the adaptation delay only upto N/2 for supporting a sampling rate of 1/[2(¹ #¹ )]. The latency of these architectures is minimal and is independent

of the filter order. Wang et al. in [11,12] also attempt to reduce the adaptation delays through the use of bit-serial and digit-serial architectures. However, these are not suitable for high sampling rate applications. (B) Pipelined architectures based on the modified DLMS algorithm [13]. Zhu et al. [14] have presented a bit-parallel architecture based on the modified DLMS algorithm that counters the effect of the adaptation delay and hence has a convergence behaviour similar to that of the standard full-band LMS algorithm. This architecture can support a sampling rate of 1/2(¹ #¹ )

with a computational overhead of 3N multiplication and N addition. Also, the architecture exhibits large latency. Matsubara et al. in [15,16] too have reported a bit-parallel architecture based on the modified DLMS algorithm. This architecture provides a control over the adaptation delay and hence the convergence behaviour, at the cost of computational requirements. This architecture can support a sampling rate of 1/(¹ #¹ ) with a computational overhead of 2d multiplications and

d additions, where 0)d)D . The effective adaptation delay of this architecture is (D !d). The case d"0 corresponds to the DLMS algorithm with D "N, and the case d"D "N corres ponds to the modified DLMS algorithm. This architecture too exhibits large latency. In sum, all the existing architectures [5—16] either support high sampling rate or exhibit improved convergence performance achieved at the cost of either of these and/or area. Further, most of these architectures suffer from large latency.

3. Synthesis of pipelined DLMS architectures In this section we synthesize pipelined DLMS adaptive filter architectures with minimal adaptation delay and minimal latency for any given sampling rate. Fig. 4 depicts the standard SFG representation of the DLMS adaptive filter, whose inputs are delayed by D holdup delays. & A sequence of function preserving transformations are applied to this SFG to obtain the pipelined DLMS architecture. Using associativity of addition, we reverse the direction of the outputaccumulation path as shown in Fig. 5. Note that, the error-broadcast path has also been reversed. These two paths can now be broken after every filter tap by retiming of cutsets indicated in the figure, wherein, each retiming step involves removal of a single delay from each of the inputs to the cutsets and addition of a single delay to each of its outputs. This requires the use of the adaptation

¹ and ¹ indicate the delay of a multiplier and an adder, respectively.

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Fig. 4. DLMS adaptive filter with the inputs delayed by D holdup delays. &

Fig. 5. Use of associativity of addition.

7

8

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Fig. 6. Further use of associativity of addition.

delay present in the error-feedback loop. Thus, unlike the existing architectures, we can break the two paths mentioned above without using the holdup delays. The holdup delays (D ) shown in & Fig. 5 are needed only for pipelining the filtering block. This would imply that, D is a small & constant independent of the filter order. Had we broken the output-accumulation path and the error-broadcast path after every filter tap, the adaptation delay required for this purpose would be equal to N. However, by the use of associativity of addition in a limited manner to parallelize sets of (say) M serial additions in the output-accumulation path as shown in Fig. 6, the number of adaptation delays required are reduced by a factor of M. The choice of M is such that, the broadcast delay of M fan-out lines in the error-broadcast path is less than the critical path delay (¹ ) of the architecture. We now exploit the slowness of the input sampling rate ( f "1/¹ ) with respect to the hardware ¹ /¹ X). This is achieved by speed ( ) to reduce the adaptation delay by a further factor of P("W 2 mapping P identical computations onto one physical hardware unit. Specifically, the P filtering arithmetic units (FAUs) and the P weight-update arithmetic units (WAUs) of a set (see Fig. 7) are mapped onto one physical FAU and WAU, respectively. This would then imply that, the error-broadcast path and the output-accumulation path need to be broken only after MP filter taps as indicated in Fig. 7, thus requiring just N/MP adaptation delays. In order to achieve this, we use the slowdown and folding transformations. We first slowdown the architecture by a factor of P, wherein each register in the retimed architecture shown in Fig. 7 is replaced by P registers and the system clock rate is set to P times the sampling rate (ie., f "Pf ). The resulting slowed-down

For ease of exposition we assume N to be a multiple of MP. The extension of the architecture to the more general case of arbitrary N is considered Appendix A.

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

9

Fig. 7. Retimed DLMS architecture.

Fig. 8. Circular systolic array for the DLMS algorithm.

architecture is (100/P)% hardware efficient. To regain 100% hardware efficiency, we use the folding transformation. Using locally—sequential—globally—parallel mapping [23], the P FAUs and WAUs of a set are mapped onto one physical FAU and WAU, respectively. Further, the FAUs and WAUs of a set are scheduled in the order of their index j, j"0,2, (P!1), in the P clock cycles of a sample period. This is the only permissible schedule due to the intra-iteration precedence constraint. The resulting folded architecture is a circular systolic array (see Fig. 8) consisting of a boundary processor module (BPM) and a series of ¸("N/MP) identical folded processor modules (FPMs). The details of the BPM and an FPM are shown in Fig. 9. Note that, when MP"1, the resulting architecture is not systolic. In order to realize a systolic array, we assume that either M or P is greater than 1, which is a reasonable assumption even for high sampling rates. The control circuitry present in the FPM shown in Fig. 9 is obtained by the approach reported in [19]. The control circuitry consists of (i) multiplexers and (ii) folded arcs with synchronization registers.

10

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Fig. 9. The BPM and the FPM of the folded DLMS architecture.

(i) Multiplexers The P-to-1 and 2-to-1 multiplexers present in the FPM (see Fig. 9) are P-periodic in nature. A P-periodic n-to-1 multiplexer is defined as follows:

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

11

Definition 1. A P-periodic n-to-1 multiplexer is a system with n inputs, x ,x ,2,x , and an output L y, with the following input/output relationship: y(k)"x (k), ∀k3Z, where s(k) is defined as QI follows: s(i)3+1,2, n,,

∀i"0,2, (P!1),

(4)

and s(k)"s(k mod P), ∀k3Z.

(5)

Eq. (4) is referred to as the multiplexer definition. The definitions of the various multiplexers in the FPM are indicated in Fig. 9. (ii) Folded arcs with synchronization registers Suppose that the computations º and » are executed in the hardware units H and H of the 3 4 folded architecture respectively and scheduled in the uth and vth clock cycles of a sample period. Then, an arc ºP» with i registers gets mapped into an arc H PH with (iP#v!u) synchroni3 4 zation registers in the folded architecture [19]. For example, the arc A identified in the retimed architecture (see Fig. 7) with a single register gets mapped into arc A in the folded architecture (see FPM in Fig. 9) with a single register (since, iP#v!u"1;P#(0)!(P!1)"1). Similarly, arc B and arc C shown in the retimed architecture with no registers get mapped into a single arc in the folded architecture with a single register (since, iP#v!u"0;P#(1)!(0)"1). The identical complex control structure present at the input to the FAU and WAU of an FPM (see Fig. 9) are now replaced by a simple control structure as shown in Fig. 10. The correctness of this special transformation is described below. 3.1. Control circuitry minimization using a special transformation Consider the complex control structure and the simple control structure depicted as systems SK and SI , respectively, in Fig. 11a and b. These systems belong to the class of periodic synchronous systems [19] and they embody their periodicity through the use of periodic multiplexers. In such systems, there often occur signals which change once every P clock cycles. These signals are referred to as P-slow signals. The clock cycle in which the P-slow signal changes is referred to as the transition-point of the P-slow signal. Further, the P clock cycles between any two adjacent transition-points are addressed as slot 0, slot 1,2, slot (P!1), in that order. Note that, for a P-slow signal x(k) whose transition-points are integer multiples of P, we have, x(k)"x(W k/P XP), ∀k3Z. In systems that have P-slow signals and P-periodic multiplexers, the index i in the multiplexer definition (see Eq. (7)) is usually the slot number with respect to a reference P-slow signal. The system SK shown in Fig. 11a consists of a delay-line with (PM!1)P registers and M P-periodic P-to-1 multiplexers. The inputs to the multiplexers are tapped from the delay-line at regular intervals of MP registers. The multiplexer mux j selects the tap-point (((P!1!i)M#( j!1))P) during slot i of the P-slow input xL . The system SI shown in Fig. 11b consists of a P-periodic 2-to-1 multiplexer, in a loop along with (PM!1) registers. This multiplexer selects input 1 during slot 0 of the P-slow input xJ and input 2 in the remaining (P!1) slots. The outputs yJ , j"1,2,M, are separated from multiplexer by ( jP!1) registers. The following H

12

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Fig. 10. Use of a special transformation.

theorem describes the functional relationship between the two systems SK and SI under identical input conditions: Theorem 1. If the inputs xL (n) and xJ (n) to the systems SK and SI , respectively, are identical P-slow signals with transition-points being integer multiples of P (without loss of generality), then the outputs of these systems are related as follows: yL (n)"yJ (n), ∀n3Z, j"1,2,M, H H zL (n)"zJ (n), ∀n mod P"0 (i.e., slot 0).

(6) (7)

A formal proof of this theorem is given in Appendix B. The proof can be intutively understood from the basic fact that, any data that enters the system SI gets recirculated P times before it leaves

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

13

Fig. 11. The system SK and SI of Theorem 1.

the system. Note that, while the complex structure (namely, the system SK ) uses M P-periodic P-to-1 multiplexers, the simple control structure (namely, the system SI ) uses just a single P-periodic 2-to-1 multiplexer. This not only reduces the hardware requirements, but more importantly, in the context of systolic systems, it keeps the delay predictable and independent of P. Further, there is a factor of P reduction in the number of registers used by the simple control structure as compared with those used by the complex control structure. 3.2. Pipelining of periodic synchronous systems We now address the issue of pipelining the various arithmetic units to achieve the desired adder level pipelined architecture. Towards this end, we require the use of the retiming transformation. The existing retiming transformation [18] is applicable only to time-invariant synchronous systems. Since, the retiming cutsets C and C indicated in Fig. 12 consists of time-varying periodic multiplexers which belong to the class of periodic synchronous systems, we extend the applicability of the retiming transformation to periodic synchronous systems by the following generalized retiming theorem. Theorem 2. Let SK be a P-periodic synchronous system and SI a system derived from SK by removing j3Z registers from each of the inputs of any cutset and adding j registers to each of the outputs of the cutset. SK and SI are equivalent if and only if for every multiplexer in the cutset, the multiplexer definitions are related by sJ (i)"sL ((i#j) mod P), ∀i"0,2, (P!1).

14

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Fig. 12. Retiming cutsets shown.

The proof of this theorem follows from a straightforward application of the results presented in [18,19,23]. A formal proof is given in Appendix C. Using this generalized retiming theorem, suitable number of adaptation registers and holdup registers are removed from the inputs to the cutsets C and C (see Fig. 12) and placed at the outputs (see Fig. 13), where m and l are, respectively, the number of pipeline registers for a multiplier and a multiplier followed by an adder tree of depth U log M V. The final pipelined BPM and FPM are shown in Fig. 13. Note that, for the proposed architecture ¹ +¹ , since the 2-to-1 multiplexer delay is negligible compared to an adder delay. The various P-periodic 2-to-1 multiplexers present in the FPM, select input 1 during one slot as given in Table 1 and input 2 in the remaining (P!1) slots. The expression for the adaptation delay (D ) and the holdup delay (D ) & are given as follows: 2m#l#2 2m#l#2 N # "¸# , D " MP P P

(8)

l D " . & P

(9)

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Fig. 13. The final pipelined BPM and FPM.

Table 1 Multiplexer definitions Mux C

Selects input 1 in slot-C

I II III IV

(!m!l!1) mod P (!m!l) mod P (!l) mod P (P!1)

15

16

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Fig. 14. The pipelined BPM and FPM using carry-save arithmetic.

3.3. Fine-grain pipelined architecture using carry-save arithmetic The critical paths of the proposed pipelined architecture that restrict the clock speed are the coefficient update loop and the output accumulation loop in the FPM (see Fig. 13). These updates and accumulations can be done in carry-save form [24]. In the case of the coefficient update loop, the vector merging [24] can be done immediately outside the loop, while in the case of the output accumulation loop, the entire accumulation can be kept in carry-save form with the vector merging done in the BPM (see Fig. 14). Further, these vector merge adders are pipelined by (say) q registers. Note that, the subtractor in the BPM will also have to be pipelined. For ease of exposition, we

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

17

assume that the subtractor is also pipelined by q registers. The expression for D and D are given & as follows: 2m#l#3q 2m#l#3q N # "¸# , D " MP P P D " &

(10)

l#q . P

(11)

The critical path of this fine-grain pipelined architecture consists of a full adder (carry-save adder) and a 2-to-1 multiplexer. In present day technology, this delay is of the order of a nanosecond, and hence the clock speed is in practice limited only by the ability to generate and distribute a very fast clock. Thus reasonable values of P and hence a significant reduction in D can be achieved even for high sampling rates, say of the order of a hundred mega samples per second. The fine-grain pipelined architecture with minimal adaptation delay can be used either for high sampling rate or for low-power applications. It therefore appears that, the use of sum relaxation [7] or vector processing [8] techniques with its attendent area penalty and convergence degradation is not necessary for high sampling rate or low-power applications. 3.4. Performance evaluation We now evaluate the performance of the proposed pipelined architecture with respect to the existing pipelined architectures for two different sampling rates, namely for ¹ "2(¹ #¹ ) [6]

and for ¹ "¹ [7], through the following examples. For uniformity in comparison with respect to the existing architectures, we choose the pipelined modules with standard arithmetic units as depicted in Fig. 13. Example 2(a). Working with the system identification configuration described in Example 1(a), we evaluate the performance of the pipelined architecture reported in [6] (which is designed to meet a specific sample period of 2(¹ #¹ )) with respect to the proposed pipelined architecture for

identical sample period. Let ¹ "3¹ and M"4. Hence, m"3, l"5 and P"8. Table 2 high lights the significant improvement of this work with respect to [6] for various metrics. There is a factor of P savings in the number of arithmetic units, namely multipliers and adders. Further, Table 2 Comparison table for Examples 2(a) and 2(b) Example 2(a)

Example 2(b)

Metric

This work

[6]

This work

[6]

D D & C of C of C of C of

6 1 33 33 557 16

128 128 257 257 1147 0

4 1 17 17 295 8

64 64 129 129 571 0

Multipliers Adders Registers 2-to-1 Muxes

18

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Fig. 15. Convergence plots for Example 2(a).

there is significant savings in the number of registers. All these hardware savings are achieved at the minimal cost of a few 2-to-1 multiplexers. Simulation results for the two architectures are shown in Fig. 15. As in Example 1(a), in this and all further system identification examples, for each plot the step size was chosen to get the fastest convergence while ensuring that the misadjustment was below 100%. The simulation was done using SIMULINK [25] which provides cycle true simulation. The architecture is thus verified at the RTL level and hence can be easily synthesized onto silicon. The ideal LMS convergence plot (i.e., D "D "0) is shown for reference. Note the significant improvement in convergence speed & for the proposed architecture. Note also the shift in the convergence plot of [6] due to the large holdup delay which is a function of the filter order. Example 2(b). Similar to Example 2(a), the performance metrics for the ALE configuration depicted in Example 1(b) are evaluated. Table 2 highlights the significant improvement of this work with respect to [6]. The corresponding simulation results for the two architectures are shown in Fig. 16. As in Example 1(b), in this and all further ALE examples, for each plot the step size was chosen to get the fastest convergence while ensuring that the misadjustment was below 30%. Note the significant improvement in convergence speed for the proposed architecture. Note also the shift in the convergence plot of [6] due to the large holdup delay which is a function of the filter order. Example 3(a). For the same system identification configuration, we now compare the performance of a special case of the architecture reported in [7] (which is designed to meet a sample period of

With D "1 and ¸A"1, the PIPLMS architecture of [7] reduces to a DLMS architecture.

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

19

Fig. 16. Convergence plots for Example 2(b).

Table 3 Comparison table for Examples 3(a) and 3(b) Example 3(a)

Example 3(b)

Metric

This work

[7]

This work

[7]

D D & C of Multipliers C of Adders C of Registers

45 5 257 257 1111

138 131 257 257 1929

29 5 129 129 677

74 67 129 129 969

¹ "¹ ) with respect to the proposed pipelined architecture for identical sample period. Note that the values of m, l and M are the same as in Example 2(a), while P"1. Table 3 highlights the improvement of this work with respect to [7] for various metrics. From Table 3, we see the significant reduction in the adaptation delay and the holdup delay. This directly contributes to an improved algorithmic performance as depicted in the simulation results shown in Fig. 17. Further, there is a significant reduction in the number of registers, while the number of arithmetic units are the same for both the architectures. Example 3(b). Similar to Example 3(a), the performance metrics for the ALE configuration are evaluated. Table 3 highlights the improvement of this work with respect to [7]. Significant reduction in the adaptation and holdup delays have directly contributed to the improved algorithmic performace of the proposed architecture as shown in Fig. 18. Further, there is a significant

20

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Fig. 17. Convergence plots for Example 3(a).

Fig. 18. Convergence plots for Example 3(b).

reduction in the number of registers, while the number of arithmetic units are the same for both the architectures. Thus far, we have presented a methodology to synthesize pipelined architectures for DLMS adaptive filtering with minimum D . It was beneficial to minimize D , since it directly contri buted to improved convergence speed. However, we show in the following section that D is a

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

21

monotonically decreasing function of the supply voltage ». This brings about a novel tradeoff between algorithmic performance and power dissipation.

4. Low-power architectures We first establish the relationship between the adaptation delay D and the supply voltage ». Towards this end, we require the following expression for the critical path delay (¹ ) in CMOS technology [4], C» , (12) ¹ (»)" k(»!» ) 2 where C is the capacitive load of the critical path, k and » are device level model parameters and 2 » is the supply voltage. As we have seen, minimizing the adaptation delay directly contributes to improved algorithmic performance. However, for a given N and ¹ , decreasing D : N increasing P (from (8); m, l and M are constants for a given critical path) N decreasing ¹ (since, P"W ¹ /¹ X) N increasing » (from Eq. (12)) Thus, D is a monotonically decreasing function of the supply voltage ». In other words, the minimum adaptation delay solution needs maximum supply voltage (say » ). Further, the critical

path delay corresponding to the supply voltage » is given by

C»

. (13) ¹ (» )" k(» !» )

2 We now derive the expression for power dissipation as a function of the supply voltage and the algorithm specifications. In CMOS technology [4], Power Dissipation"C »f where C is the effective switched capacitance of the realization and f ("1/¹ ) is the clock frequency +((2¸M)(C #C ))»(Pf ) where C and C are the effective switched + + capacitance of a multiplier and an adder respectively (for ease of exposition, we neglect the power dissipated by the control circuitry and the boundary processor module) "2N(C #C )»f . (14) + In Eq. (14), for the given algorithm specifications and technology, the only free variable is the supply voltage ». We now synthesize low-power architectures by treating the supply voltage » as a free variable and using parallelism to meet the required computational throughput [4]. Towards synthesizing the architecture under a power constraint, let us define the power reduction factor b as » Desired power dissipation " (From Eq. (14)). b" » Peak power dissipation

(15)

22

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Note that b equals 1 for the minimum adaptation delay solution. The supply voltage » for the architecture which meets the given b((1) can be calculated from Eq. (15). Further, the critical path delay for the supply voltage » is given by

(» ) (» !» ) » 2 ; . (16) » (»!» )

2 Hence from P"W ¹ /¹ (») X, the required value of P can be obtained. Further, the number of required FPMs can be obtained from ¸"N/MP. Since we use parallelism to reduce power, for any given power reduction factor b, we need more number of FPMs than compared with the minimum D solution. Further note that, for any b(1, the adaptation delay increases because P decreases. This brings about the interesting tradeoff between algorithmic performance and power dissipation. This tradeoff is illustrated through the following examples. ¹ (»)"

¹

Example 4(a). For the system identification configuration, we assume the input sample period to be 80 ns. The critical path delay at the maximum supply voltage (» ) of 3.3 V and threshold

voltage (» ) of 0.6 V is assumed to be 10 ns. As before, ¹ "3¹ , M"4, m"3 and l"5. 2

Table 4 highlights the two cases of this work, namely the synthesis solution without power constraint (b"1) and the synthesis solution with a power constraint b"0.25, for various metrics. Note that, the synthesis solution with a power constraint uses extra hardware in order to achieve the power reduction. Further, the adaptation delay is more than the minimum adaptation delay solution (or synthesis solution without power constraint). Fig. 19 shows the simulation results with D "6 and D "1 (minimum D solution or fast convergence solution) and D "23 and D "3 & & (synthesis solution with power constraint or low-power solution). It is clear from the figure that algorithmic performance (namely convergence speed) is traded for power. Also note that, in the same plot, the convergence performance of the architecture reported in [6] is shown. The convergence speed of this architecture operating at the maximum supply voltage » and hence dissipating the peak power of 2N(C #C )» f , is significantly worse than the

+ low-power solution of this work. Table 4 further highlights the significant improvement of this work compared with [6] for various metrics. Table 4 Comparison table for Examples 4(a) and 4(b) Example 4(a) Metric

» P D D & C of C of C of C of

Multipliers Adders Registers 2-to-1 Muxes

Example 4(b)

This work b"1

b"0.25

Previous work [6]

This work b"1

b"0.25

Previous work [6]

3.3 V 8 6 1 33 33 557 16

1.65 V 2 23 3 129 129 873 64

3.3 V 1 128 128 257 257 1147 0

3.3 V 8 4 1 17 17 295 8

1.65 V 2 15 3 65 65 449 32

3.3 V 1 64 64 129 129 571 0

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

23

Fig. 19. Convergence plots for Example 4(a).

It is interesting to note the impact of the following delay model, described in [26] for sub-micron CMOS technology, on the trade-off between algorithmic performance and power dissipation: C» , ¹ (»)" k(»!» )? 2

(17)

where, a ranges between 1 and 2. For example, in a 0.6 lm CMOS technology [27] wherein a"1.6, D and D for the low-power solution are 16 and 2, respectively. This would then imply that the & degradation in the convergence speed for the low-power solution (b"0.25) with respect to the fast convergence solution (b"1) is less for a"1.6 than for a"2. Example 4(b). Similar to Example 4(a), the performance metrics for the ALE configuration are evaluated. Table 4 highlights the two cases of this work, namely the fast convergence solution and the low-power solution, and compares with [6] for various metrics. Fig. 20 shows the simulation results for the two cases. It is clear from the figure that algorithmic performance (namely convergence speed) is traded for power. Also shown in the same plot is the convergence performance of the architecture reported in [6]. The convergence speed of this architecture operating at the maximum supply voltage » and hence dissipating the peak power, is significantly worse than

the low-power solution of this work.

24

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Fig. 20. Convergence plots for Example 4(b).

Note that, since Eqs. (15) and (16) are simple high-level equations, the precise power reduction in an actual implementation may be some what different from b. Nevertheless, it is significant that such a tradeoff can be made at the highest level of abstraction, viz., the algorithmic level.

5. Conclusions We have presented a methodology to synthesize pipelined architectures for DLMS adaptive filtering with minimal adaptation delay. The systolic pipelined architecture was shown to have superior convergence speed compared with existing pipelined architectures. The improvement in the convergence speed was achieved by applying the right sequence of function preserving transformations on the signal flow graph representation of the DLMS algorithm to derive the pipelined architecture. We then extended the synthesis methodology to synthesize pipelined architectures with minimal adaptation delay subject to a power constraint. The resulting low-power pipelined DLMS adaptive filter architecture exhibits a novel tradeoff between algorithmic performance and power dissipation. This architecture exploits the parallelism in the DLMS algorithm to meet the required computational throughput. It was shown that this use of parallelism to reduce power results in an increase in the adaptation delay, thereby creating a tradeoff between algorithmic convergence speed and the power reduction factor. This tradeoff was illustrated with system identification and adaptive line enhancer examples.

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

25

Appendix A. Architectures for arbitrary filter order N For an arbitrary filter order N, the final pipelined architecture implements the following equations: /\ y(n)" a (n)x(n!k!D ), I & I where a (n)"0 ∀N)k(Q I

where Q"

(A.1)

N P , P

a (n#1)"a (n)#ke(n!D )x(n!k!D !D ), ∀k, (A.2) I I & e(n)"d(n!D )!y(n). (A.3) & The implementation for an arbitrary N minimizes the number of null operations, which is strictly less than P. When N is not a multiple of MP, then the rightmost FPM of this architecture would be different from the remaining FPMs as shown in Fig. 21. Specifically, it has P-periodic 2-to-1 multiplexers at the coefficient input to its filtering block in order to introduce the null operations. Further, the number of fan-out lines in the error-broadcast path of this FPM is M K "(U N/PV!WN/MP XM) (see Fig. 21), which is less than or equal to M. The rightmost FPM in Fig. 21 is shown in its most general form, wherein the P-periodic 2-to-1 multiplexers are present at each of the coefficient inputs to its filtering block. However, depending on the relative values of M K and P, not all the inputs need to have 2-to-1 multiplexers. The altered expression for D is given by N 2m#l#2 D " # . MP P

(A.4)

However, the expression for D remains unaltered. & Appendix B. Proof of Theorem 1 The proof of Theorem 1 requires the use of the Lemma described below. Consider the system sL shown in Fig. 22a, which consists of a delay-line with (P!1)MP registers and a P-periodic P-to-1 multiplexer (mux 1). The inputs to mux 1 are tapped from the delay-line at regular intervals of MP registers. As shown in the figure, mux 1 selects the tap-point ((P!1!i)MP) during slot i of the P-slow input xL . The system sJ shown in Fig. 22b consists of a P-periodic 2-to-1 multiplexer, mux 2, in a loop along with (PM!1) registers. The multiplexer mux 2 selects input 1 during slot 0 of the P-slow input xJ and input 2 in the remaining (P!1) slots. The output yJ is separated from mux 2 by (P!1) registers. The following Lemma describes the functional relationship between the two systems sL and sJ under identical input conditions:

26

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

Fig. 21. The rightmost FPM for an arbitrary filter-order N.

Lemma B.1. ¸et the inputs to the systems sL and sJ , namely xL (n) and xJ (n), respectively, be identical P-slow signals whose transition-points are (without loss of generality) integer multiples of P. ¹he outputs of these systems are related as follows: yL (n)"yJ (n), ∀n3Z,

(B.1)

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

27

Fig. 22. The system sL and sJ of Lemma B.1.

zL (n)"yJ (n), ∀n mod P"0 (i.e., slot 0). Proof. The input/output relationship of sL is described by

xL (n!(P!1)MP)

∀n mod P"0 (slot 0),

xL (n!(P!2)MP)

∀n mod P"1 (slot 1),

yL (n)" 2 xL (n)

∀n mod P"(P!1) (slot (P!1)),

"xL (n!(P!1!n mod P)MP), ∀n3Z and zL (n)"xL (n!(P!1)MP), ∀n3Z.

(B.2)

28

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

The input/output relationship of sJ is described by

xJ

n P P

xJ

n P!MP P

yJ (n)"

∀n mod P"(P!1),

∀n mod P"(P!2),

2

xJ

"xJ

n P!(P!1)MP P

∀n mod P"0

n P!(P!1!n mod P)MP , ∀n3Z. P

Hence,

yJ (n)"xL

n P!(P!1!n mod P)MP , P

∀n3Z;

since xJ (n)"xL (n), ∀n3Z

"xL since

n!(P!1!n mod P)MP P , ∀n3Z; P

n n!kP P!kP" P, ∀n, k3Z P P "xL (n!(P!1!n mod P)MP), ∀n3Z;

since xL (n)"xL

n P , ∀n3Z P

"yL (n), ∀n3Z

and

"zL (n), ∀n mod P"0.

䊐

Proof. Now, consider a portion of SK shown in Fig. 23a. By Lemma B.1, the functional relationship between this structure and the structure shown in Fig. 23b is given by yL (n)"yJ (n), ∀n3Z, j"1,2,M H H

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

29

Fig. 23. Proof of Theorem 1.

and zL (n)"zJ (n), ∀n mod P"0 (i.e., slot 0). Further, due to Theorem 2, the structures shown in Fig. 23 b and c are equivalent. Note that, the structure shown in Fig. 23c is a portion of SI . Hence the proof is complete. 䊐 Appendix C. Proof of Theorem 2 The proof of Theorem 2 requires the following Lemmas. Lemma C.1. ¸et SK be a time-invariant synchronous system and SI a system derived from SK by moving j3Z delay elements from each of the inputs of any cutset of SK to each of its outputs. ¹he two systems SK and SI are equivalent.

30

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

The above Lemma is the classic retiming theorem of [18] stated in the language of [23]. The following Lemma [19] describes retiming of a P-periodic multiplexer. Lemma C.2. ¸et SK be a system with j3Z delay elements present at all the inputs to a P-periodic n-to-1 multiplexer defined by sL (i), and SI a system with j delay elements at the output of a Pperiodic n-to-1 multiplexer defined by sJ (i). SK and SI are equivalent if and only if sJ (i)" sL ((i#j) mod P), ∀i"0,2,(P!1).

Proof. The proof of the theorem follows from the series of transformations shown in Fig. 24. The basic idea is that, any cutset in a periodic synchronous system can be viewed as an interconnection of time-varying and time-invariant subsystems (see Fig. 24b). The time-varying subsystem consists of the periodic multiplexers, while, the time-invariant subsystem consists of the remaining circuitry. As shown in Fig. 24c, delays and anti-delays are created at the output of the time-varying subsystem. Now Lemma C.1 is applied, resulting in the system shown in Fig. 24d. Next, Lemma C.2 is applied to this system, resulting in the retimed system shown in Fig. 24e. Note that, by Lemma C.2, the definitions of all the multiplexers in the time-varying subsystem, are related by sJ (i)"sL ((i#j) mod P), ∀i"0,2, (P!1). Hence the proof is complete. )

Fig. 24. Proof of Theorem 2.

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32

31

References [1] B. Widrow, S.D. Sterns, Adaptive Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 1985. [2] B. Hassibi, A.H. Sayed, T. Kailath, H optimality of the LMS algorithm, IEEE Trans. Signal Process. 44 (2) (1996) 267—279. [3] G. Long F. Ling, J.G. Proakis, The LMS algorithm with delayed coefficient adaptation, IEEE Trans. Acoust., Speech, Signal Process. 37 (9) (1989) 1397—1405; and corrections, 40 (1) (1992) 230—232. [4] A.P. Chandrakasan, S. Sheng, R.W. Brodersen, Low-power CMOS digital design, IEEE J. Solid-State Circuits 27 (1992) 473—484. [5] H. Herzberg, R. Haimi-Cohen, Y. Be’ery, A systolic array realization of an LMS adaptive filter and effects of delayed adaptation, IEEE Trans. Signal Process. 40 (11) (1992) 2799—2802. [6] M.D. Meyer, D.P. Agrawal, A high sampling rate delayed LMS filter architecture, IEEE Trans. Circuits and Systems — II: Analog and Digital Signal Processing 40 (11) (1993) 727—729. [7] N.R. Shanbhag, K.K. Parhi, Relaxed look-ahead pipelined LMS adaptive filters and their application to ADPCM coder, IEEE Trans. Circuits and Systems — II: Analog and Digital Signal Processing 40 (12) (1993) 753—766. [8] M.D. Meyer, D.P. Agrawal, Vectorization of the DLMS transversal adaptive filter, IEEE Trans. Signal Process. 42 (11) (1994) 3237—3240. [9] V. Visvanathan, S. Ramanathan, A modular systolic architecture for delayed least mean squares adaptive filtering, Proc. 8th Internat. Conf. VLSI Design, Delhi, India, 1995, pp. 332—337. [10] J. Thomas, Pipelined systolic architectures for DLMS adaptive filtering, J. VLSI Signal Process. 12 (6) (1996) 223—246. [11] C.-L. Wang, Bit-serial VLSI implementation of delayed LMS adaptive filter, IEEE Trans. Signal Process. 42 (8) (1994) 2169—2175. [12] C.-L. Wang, C.-C. Chen, C.-F. Chang, A digit-serial VLSI architecture for delayed LMS adaptive FIR filtering, Proc. ISCAS, 1995, pp. 545—548. [13] R.D. Poltmann, Conversion of delayed LMS algorithm into the LMS algorithm, IEEE Signal Process. Lett. 2 (12) (1995) 223. [14] Q. Zhu, S.C. Douglas, K.F. Smith, A pipelined architecture for LMS adaptive FIR filters without adaptation delay, Proc. ICASSP, 1997, pp. 1933—1936. [15] K. Matsubara, K. Nishikawa, H. Kiya, A new pipelined architecture of the LMS algorithm without degradation of convergence characteristics, Proc. ICASSP, 1997, pp. 4125—4128. [16] K. Matsubara, K. Nishikawa, H. Kiya, Pipelined LMS adaptive filter with a new lookahead transformation, Proc. ISCAS, 1997, pp. 2309—2312. [17] S. Ramanathan, V. Visvanathan, A systolic architecture for LMS adaptive filtering with minimal adaptation delay, Proc. 9th Internat. Conf. VLSI Design, Bangalore, January 1996, IEEE CS Press, Silver Spring, MD, pp. 286—289. [18] C.E. Leiserson, J.B. Saxe, Optimizing synchronous systems, J. VLSI and Computer Systems 1 (1) (1983) 41—67. [19] K.K. Parhi, C.-Y. Wang, A.P. Brown, Synthesis of control circuits in folded pipelined DSP architectures, IEEE J. Solid State Circuits 27 (1) (1992) 29—43. [20] K.K. Parhi, High-level algorithm and architecture transformations for DSP synthesis, J. VLSI Signal Process. 9 (1) (1995) 121—143. [21] A.P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, R.W. Brodersen, Optimizing power using transformations, IEEE Trans. Computer-Aided Design Integrated Circuits and Systems 14 (1) (1995) 12—31. [22] S. Ramanathan, V. Visvanathan, Low-Power Configurable Processor Array for DLMS Adaptive Filtering, Proc. 10th Internat. Conf. VLSI Design, Hyderabad, January 1997, IEEE CS Press, Silver Spring, MD. [23] S.Y. Kung, VLSI Array Processors, Prentice-Hall, Englewood Cliffs, NJ, 1988. [24] R. Jain, P.T. Yang, T. Yoshino, FIRGEN: A computer-aided design system for high performance FIR filter integrated circuits, IEEE Trans. Signal Processing 39 (7) (1991) 1655—1668. [25] SIMULINK User’s Guide The MathWorks Inc., December 1993. [26] T. Sakurai, A.R. Newton, Delay analysis of series-connected MOSFET circuits, IEEE J. Solid-State Circuits 26 (1991) 122—131. [27] U. Ko, P.T. Balsara, W. Lee, Low-power design techniques for high-performance CMOS adders, IEEE Trans. Very Large Scale Integration (VLSI) Systems 3 (2) (1995) 327—333.

32

S. Ramanathan, V. Visvanathan / INTEGRATION, the VLSI journal 27 (1999) 1—32 S. Ramanathan received his B.Sc degree in Physics from Madras Univeristy in 1988. He completed his M.E from the department of Electrical Communication Engineering at Indian Institute of Science in the year 1992. He is currently pursuing his Ph.D in the Supercomputer Education and Research Centre at Indian Institute of Science. His research interests are in the area of VLSI Signal Processing and Low-Power VLSI Systems.

V. Visvanathan received the B. Tech degree in Electrical Engineering from the Indian Institute of Technology, Delhi, India, the M.S.E.E. degree from the University of Notre Dame, Indiana, U.S.A., and the Ph.D. degree in Electrical Engineering and Computer Sciences from the University of California, Berkeley, U.S.A. He has taught at the Univeristy of California, Berkeley and the Univeristy of Maryland, College Park, and has worked at the IBM. T.J. Watson Research Center, Yorktown Heights, New York, and at Bell Labs, Murray Hill, New Jersey. He is currently an Associate Professor at the Indian Institute of Science, Bangalore, and on sabbatical leave at the Bell Labs Design Automation Center, Lucent Technologies, Allentown, Pennsylvania. His research interests are in the area of design automation of mixed analog/digital VLSI systems for signal processing and telecommunications applications.

Low-power pipelined LMS adaptive filter architectures with minimal adaptation delay1

Low-power pipelined LMS adaptive filter architectures with minimal adaptation delay1

Recommend Documents