Design and Evaluation of a Photonic FFT Processor

Design and Evaluation of a Photonic FFT Processor

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO. 41, 131–136 (1997) PC961291 Design and Evaluation of a Photonic FFT Processor Richard G. ...

158KB Sizes 0 Downloads 32 Views

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO.

41, 131–136 (1997)

PC961291

Design and Evaluation of a Photonic FFT Processor Richard G. Rozier, Fouad E. Kiamilev, and Ashok V. Krishnamoorthy* University of North Carolina Charlotte, Charlotte, North Carolina 28223; and *Lucent Technologies, Bell Labs Innovations, Holmdel, New Jersey 07733

We describe the design of a high-performance chipset for computing 1-D complex fast Fourier transforms (FFT). Our chip technology is based on flip-chip integration of submicrometer CMOS ICs with GaAs chips containing 2-D arrays of multiple-quantum wells (MQW) diode optical receivers and transmitters. Our system technology is based on free-space optical interconnection. Compared to an electronic implementation, the proposed design offers increased performance and a reduced chip count. Since our design uses direct, one-to-one interconnections between chips, it is compatible with a number of emerging three-dimensional freespace optoelectronic packaging technologies. © 1997 Academic Press

768 optical inputs and 768 optical outputs, providing a chipto-chip bandwidth of 29 Gbytes/s at 150 MHz clock speed. A complete 8,192 point (or smaller) FFT processor can be built using just three chips. It is shown that the proposed design offers significant advantages over purely electronic implementations in terms of increased performance and reduced chip count. Section 2 discusses the FFT algorithm and its partitioning for an FSOI implementation. In Section 3, we present detailed IC design for the PFFT and the PRAM chips and a discussion of the optical system. Performance evaluation and comparison with existing electronic implementations is provided in Section 4. Section 5 winds up with some conclusions about our design.

1. INTRODUCTION 2. FFT ALGORITHM MAPPING

Free-space optical interconnection (FSOI) of integrated circuits shows great potential for efficient implementation of high-performance parallel computing and signal processing systems [1, 2]. On the implementation front, several FSOI technologies are now under development and prototype FSOI systems have been demonstrated [3, 4]. However, with the exception of switching and backplane interconnect applications [5, 6], there has been relatively little published work that includes detailed integrated circuit (IC) design and accounts for the impact of IC layout constraints on FSOI system architecture, design and performance. Such a detailed design study is the objective of this paper. We focus on the specific problem of designing a high-speed FSOI FFT processor. Our FSOI technology is based on the hybrid CMOS–SEED platform that integrates GaAs MQW photodetectors and modulators with high-volume commodity CMOS VLSI processes [7]. For CMOS technology we use the 0.5 m HP/MOSIS CMOS14 process [8]. This is a high-speed, high-density 3.3 V/5 V CMOS process with a tight-pitch (1.4 m) planarized interconnect system that allows three levels of metallized interconnect. Our optical interconnect technology is based on a free-space holographic interconnect technology that has previously been used in several FSOI demonstrators [9]. The result of this work is the design of a chipset for building FSOI FFT processors consisting of the Photonic FFT (PFFT) chip and the Photonic RAM (PRAM) chip. The PRAM chip integrates over 2.5 million transistors in an estimated 1 2 1 cm die area. The PFFT chip integrates over 1.4 million transistors in an estimated 1.2 2 1.2 cm die area. Each chip has

2.1. The FFT Algorithm The FFT algorithm is a series of multiplications and additions which are performed in stages. For an N -point FFT, the FFT is calculated with N /radix# butterflies per stage and log radix# N stages. The radix number tells how many inputs and outputs the BP will have. Using radix-2 butterflies, there will be N /2 BPs per stage and log2 N stages, where each BP has 2 inputs and 2 outputs. The inputs of the FFT are data points sampled in the time domain. The outputs are discrete frequency-domain data points. The equations that result for a radix-2 BP are

A = A + WNk B B = A 0 WNk B; where A and B are complex numbers. dle factor” and is expressed as

WNk

=

(2.1) (2.2)

WNk is called the “twid-

e0j 2 (k=N ) :

(2.3)

For actual hardware implementation, these equations should be broken down into real and imaginary components:

AR = AR + BR WR 0 BI WI AI = AI + BR WI + BI WR BR = AR 0 (BR WR 0 BI WI ) BI = AI 0 (BR WI + BI WR ):

(2.4) (2.5) (2.6) (2.7)

131 0743-7315/97 $25.00 Copyright © 1997 by Academic Press All rights of reproduction in any form reserved.

132

ROZIER, KIAMILEV, AND KRISHNAMOORTHY

FIG. 1. Radix-2 butterfly block diagram.

The above equations can be implemented in a straightforward fashion using four multipliers and six adders as shown in Fig. 1 (where subtraction can be performed by twos’-complement conversion utilizing inverters and the carry-in bit on the adder). Our design is a semiparallel implementation that places eight radix-2 BPs on one chip, called the Photonic FFT (PFFT), as shown in Fig. 2. It takes advantage of electronic implementations as well as optical. Eight butterfly calculations are performed electronically per clock cycle and data storage communication is accomplished through optical interconnects. 2.2. Butterfly Processor Operation The inputs and outputs of the BPs are connected to optical receivers and transmitters that will be described later in the paper. The twiddle factor generator will be discussed later also. The processors operate in parallel and are loaded with data from a separate memory chip. The eight BPs will always work on data from one stage. The FFT procedure is as follows: 1. All of the data for one stage are entered into the eight BPs 16 points at a time, where each retrieval of 16 points is executed during one clock cycle. A perfect shuffle is required for the FFT and is achieved by reading into each BP 2 points that are spaced N /2 apart, as seen in Fig. 3. 2. The data that come out of the BPs is stored in memory until it is needed for calculation in the next stage. As with the data input, each 16 points of output is stored in one clock cycle. And, unlike the input, the output is cached as consecutive data points. 3. The procedure continues with data being read from one memory unit and stored in another until all the data from one stage has been processed. The next stage of data is then

FIG. 2. Eight radix-2 BP’s on a chip (PFFT).

DESIGN AND EVALUATION OF A PHOTONIC FFT PROCESSOR

FIG. 3. Data retrieval and storage to and from the PRAM.

calculated in the same fashion using those numbers that were just stored from the previous stage. The fact that eight BPs are operating in parallel, along with the need for a perfect shuffle of the data, requires a special

133

data storage scheme. This memory configuration, called the Photonic RAM (PRAM), shown in Fig. 4, is divided into four banks of RAM. Each bank of RAM is further subdivided into four units representing the physical blocks of storage, as indicated by the dashed lines in Fig. 4. When reading data into the BPs, banks 0 and 2 will be read from first at the same address. Then banks 1 and 3 will be accessed from the same address also. This insures that all the BPs are supplied with data simultaneously and that each BP gets data that are N /2 points apart. The operation continues, reading data from banks 0 and 2 followed by banks 1 and 3 at consecutive addresses until all the data for one stage have been read. When data are stored, on each pass, 16 consecutive points of data have to be written simultaneously; see Fig. 3. Therefore, the outputs are stored in the same address in banks 0 and 1, as shown in Fig. 4. For example, on pass 1, data points 0 through 15 will result from the BP computations and are stored in address 0 of banks 0 and 1. On the next pass, the outputs are stored in the next address of banks 0 and 1. This repeats until banks 0 and 1 are full. Then the routine proceeds to write to the same successive addresses of banks 2 and 3. When banks 2 and 3 are full, another stage has been completed. This memory retrieval and storage method has the added benefit that the data from the previous stage can be read directly into the current stage with no transformations. In other

FIG. 4. Memory bank configuration (PRAM).

134

ROZIER, KIAMILEV, AND KRISHNAMOORTHY

FIG. 5. Single PFFT system.

words, the data shuffle between stages is achieved without any physical reorganization of the data. 2.3. The FFT System The two basic elements of the FFT design are the PFFT and the PRAM. These two form a cascadable chip set. The system designer can choose from the fastest and largest design, which would be a pipelined system with log2 N PFFTs and PRAMs, to the smallest configuration, which would be a single PFFT and two PRAMs. The smallest system would operate as follows, shown in Fig. 5. 1. PRAM 0 is filled with the initial data set. 2. The first stage of the FFT is calculated, 16 points at a time through the PFFT, and the data are stored in PRAM 1. 3. The second stage of the FFT is performed, this time with data being read from PRAM 1 and the output stored in PRAM 0. 4. Steps 2 and 3 are repeated in a ping-pong fashion until enough passes have been completed to finish the FFT. The multiple PFFT system is a variable sized pipelined structure as shown in Fig. 6. It requires m PFFTs and m + 1 PRAMs for maximum performance, where m = log2 N . The initial data set is stored in PRAM 0, and data march through the pipeline from one processor to the next, each time completing a stage of calculations. The final data are stored in PRAM m + 1 or can be read directly from PFFT m. 3. CHIPSET DESIGN SECTION

have confirmed switching energies less than 60 fJ and operation at data-rates in excess of 800 Mb/s. The area of a simple transimpedance receiver circuit that uses one gain-stage and two parallel MOS transistors as the feedback element, is approximately 300 m 2 in a 0.8 m 5 V CMOS technology [10]. The optical transmitter circuit is a CMOS inverter operating at 5 V. The output of this inverter sees a pair of reversebiased quantum well diode modulators as the capacitive load. With the present hybrid SEED process this capacitance is approximately 100 fF [12], which is an order of magnitude smaller than the capacitance of flip-chip or perimeter pads used for electronic packaging. Lower capacitance translates directly into low on-chip power consumption. The dynamic power consumption per transmitter for the proposed chipset can be estimated using the following:

Pdiss = 12 CV 2f = = 200 W:

1 (100 fF)(5 V)2 (150 MHz) 2 (3.1)

In addition, there will be an extra 200–300 W of power dissipation per transmitter due to the photocurrent flowing through the output driver when the modulator is in the absorbing stage. This leads to about 350 mW of power consumption for the 768 transmitters used in our chipsets. In summary, the chipset calls for 768 dual-beam optical receivers and 768 dual-beam optical transmitters per chip. These are arranged in a 48 2 64 array of quantum well diodes corresponding to a 48 2 32 array of transmitters and receivers. The floorplan for the proposed chipsets puts the quantum well diode array in the middle of the chip. The diode array uses a 96 2 64 array of pads which requires 3.8 2 2.5 mm of silicon area, assuming 40 m pad pitch. Receiver amplifier and transmitter driver circuits are placed directly under the quantum well diode array. The power consumption of the entire receiver/transmitter section is approximately 3 Watts per chip using existing transimpedance receivers; this can be reduced to 1–2 Watts with receivers optimized for low power.

3.1. Optical Transmitter and Receiver Circuits A number of hybrid SEED receiver circuits have been designed and tested recently [10, 11]. Experimental measurements on single-ended and two-beam transimpedance receivers

3.2. Optics Configuration For typical smart pixel applications such as photonic switching, the cascaded optical connections between chips are strictly

FIG. 6. Multiple PFFT system.

135

DESIGN AND EVALUATION OF A PHOTONIC FFT PROCESSOR

feedforward. Feedforward multichip optical systems have previously been demonstrated by combining polarization with space-division multiplexing [3] and polarization with pupildivision multiplexing [4]. Both of these optical systems are applicable to the multichip pipelined PFFT processor (system 2) and readers are referred to these papers for a detailed description. Note that systems 1 and 2 do not need an optical data shuffle, since the shuffle is achieved through TDM. Because the PFFT processor uses only simple one-to-one connection, the use of a holographic optical interconnect is not required; this simplifies the system and reduces loss. 3.3. Photonic RAM Chip

The photonic RAM (PRAM) chip integrates sixteen 256 2 96 SRAMs on a single chip. This yields a sum of over 2.5 million transistors in an estimated 1 2 1 cm die area. The 768 optical inputs and 768 optical outputs are arranged as a 48 2 32 transmitter–receiver array in the center of the PRAM chip. The receiver amplifier and transmitter driver circuits are placed directly under the quantum well diode array. This floorplan allows for efficient wire routing between the SRAMs and the transmitter–receiver array. The PRAM chip supports simultaneous READ and WRITE access to enable single-FFT and multi-FFT processor systems. In the single-FFT mode, the power consumption is reduced since only 8 SRAMs out of 16 are active at any single time. We estimate the on-chip power consumption at

Pdiss = P



optical



I O

+ 8P (256

2 96 SRAM)

= 3:1 W + 8(328 mW) = 5:72 W:

(3.2)

In the multi-FFT mode, 12 of the 16 SRAMs are being accessed at the time. The on-chip power consumption would then be

Pdiss = P



optical

I O



+ 12P (256

2 96 SRAM)

= 3:1 W + 12(328 mW) = 7:04 W:

(3.3)

3.4. Photonic FFT Chip The architecture of the radix-2 butterfly processor was described in Section 3. The implementation uses a fixedpoint twos’ complement numeric format. It requires 70,136 transistors in a 1.9 2 2.8 mm area. Power consumption is estimated at 474 mW operating at 150 MHz with 50% switching factor. The design is fully pipelined and can compute a new radix-2 butterfly in every clock cycle. The pipeline latency is 3 clock cycles. The processor has 96 inputs for two 48-bit complex numbers with 24-bit real and 24-bit imaginary parts. Similarly, 96 outputs are used for two 48-bit complex numbers. There are 32 input ports to bring in one 32-bit fixedpoint complex twiddle coefficients with 16-bit real and 16-bit imaginary parts. The photonic FFT (PFFT) chip integrates eight radix2 butterfly processors and two 512 2 128 twiddle-storage

SRAMs on a single chip. The PFFT chip integrates over 1.4 million transistors in an estimated 1.2 2 1.2 cm die area. The 768 optical inputs and 768 optical outputs are arranged as a 48 2 32 transmitter–receiver array in the center of the PFFT chip. The receiver amplifier and transmitter driver circuits are placed directly under the quantum well diode array. This floorplan allows for efficient wire routing between the eight processor modules and the diode array. The PFFT chip operates in pipelined mode computing eight radix-2 butterflies every clock cycle. Thus on every clock cycle, 768 optical inputs are received and 768 optical outputs are transmitted. The twiddle coefficients required to compute the FFT are stored directly on the PFFT chip in two banks of 512 2 128 SRAMs. This memory capacity allows FFT sizes up to 8,192 points to be computed. Twiddle coefficients are loaded electrically prior to computing the FFT. The power consumption for the PFFT chip is estimated at

Pdiss = P



optical

I O



+ 8P (PE) + 2P (SRAM)

= 3:1 W + 8(474 mW) + 2(560 mW) = 8 W: (3.4)

4. PERFORMANCE EVALUATION

Compared with electronic implementations, the photonic design described in this paper provides higher chip pin-count and off-chip I/O bandwidth. Specifically, the PFFT and the PRAM chips have the following: PINOUT = 8 (PEs)

2 [48 2 2 (inputs)

2 2 (outputs)] = 1;536 BANDWIDTH = 1;536 2 150 MHz = 230;400 Mbit/s + 48

= 29 Gbyte/s;

which is an order of magnitude increase in off-chip bandwidth over conventional silicon ICs [13]. On-chip power density is less than 10 W/cm 2 and it is dominated by the switching of the internal electronic circuits. The single PFFT system described in Section 2.3 will now be referred to as system 1. Let us discuss the 8,192-point complex FFT, which is the maximum our design can handle. It requires 53,248 (N /2 2 log2 N ) radix-2 butterflies. Using our eight-processor FFT chip, we need a total of 53,248/8 passes, each taking up 6.7 nsec (1/150 MHz). This adds up to a total computation time of 44 s. For a 1,024-point complex FFT, 4.3 s will be required. At the chip-level, the PFFT chip clearly outperforms electronic designs. At the board level, the performance advantage is not as clear. Since electronic systems have difficulty achieving 29 Gbyte/s interchip communication, they partition FFT computation to achieve maximum performance while keeping off-chip I/O within bounds of electronic packaging [14]. Typically, a radix2 or higher butterfly is implemented on a single chip and data is fed to it from an SRAM. Then, to achieve higher performance, processors are arranged in pipelined fashion to execute the log2 N stages of the FFT computation [15].

136

ROZIER, KIAMILEV, AND KRISHNAMOORTHY

To build a higher-performance smart pixel FFT processor, we can operate multiple photonic FFT and RAM chips in a pipelined fashion, as described in Section 2.3. In this approach, the latency remains the same, while the throughput increases dramatically. This system, called system 2, can achieve an order-of-magnitude performance increase with a corresponding increase in hardware cost. For the 8,192-point complex FFT, there would be 13 PFFTs and 14 PRAMs for a total of 27 chips. Unlike system 1, where it is necessary to wait for the data to “ping-pong” back and forth before a new FFT can be calculated, system 2 is a straight-through pipeline, in which a new FFT is calculated every time a complete pass is made on the last stage. The resulting throughput is given by [(# of butterflies per stage)/8] 2 6.7 nsec = (4096/8) 2 6.7 nsec = 3.4 s. This means that a new FFT is computed every 3.4 s. For a 1024-point complex FFT, 0.44 s will be required. One important consideration for pipelined systems is that their external bandwidth requirements are increased as well. Thus, these types of systems are useful in focal plane arrays, where sensor data can be captured and passed to the processor array without creating a bottleneck. 5. DISCUSSION AND CONCLUSIONS

The design presented in this paper offers a highly flexible, high-performance, low chip-count solution to the numerous applications that require a high-speed FFT calculation. Our design uses fixed-point complex number arithmetic. To increase precision, a simple modification, called block-floating point, can be added to our chipset. On the other hand, a true floating-point implementation would significantly increase the complexity and transistor count of the PFFT chip. ACKNOWLEDGMENTS The contributions of Craig Lund of Mercury Computer, David Schwartz of Hughes, and Michael Fleming of Butterfly DSP are acknowledged, specifically in regard to suggesting useful modifications to the photonic FFT chipset design. The work at UNC Charlotte was funded by a research grant from AFOSR (F49620-95-1-0257).

REFERENCES 1. Forrest, S. R., and Hinton, H. S. (Eds.), Special issue on smart pixels. IEEE J. Quantum Electron. 29, 2 (Mar. 1993). 2. Hinton, H. S. An Introduction to Photonic Switching Fabrics. Plenum, New York, 1993. 3. McCormick, F., et al. Five-stage free-space optical switching network with field-effect transistor self electro-optic effect device smart-pixel arrays. Appl. Opt. 33, 8 (Mar. 1994), 1601–1618. 4. McCormick, F., et al. Six-stage digital free-space optical switching network using symmetric self electro-optic effect devices. Appl. Opt. 32, 26 (Sept. 1993), 5153 –5171. 5. Cloonan, T. J. Comparative study of optical and electronic interconnection technologies for large asynchronous transfer mode packet switching applications. Opt. Engrg. 33, 5 (May 1994), 1512 –1523. 6. Krishnamoorthy, A. V., Marchand, P. J., Kiamilev, F. E., and Esener, S. C. Grain-size considerations for optoelectronic multistage interconnection networks. Appl. Opt. 31, 26 (Sept. 1992), 5480 –5507.

7. Goossen, K. W., et al. GaAs MQW modulators integrated with silicon CMOS. IEEE Photon. Technol. Lett. 7, 4 (Apr. 1995), 360–362. 8. CMOS14TB Design Reference Manual. Hewlett-Packard ICBD Division, 1995. 9. McCormick, F. B. Free space interconnection techniques. In Photonics in Switching (J. E. Midwinter, Ed.), Vol. II, pp. 169 –250. Academic Press, New York, 1993. 10. Krishnamoorthy, A. V., et al. Ring oscillators with optical and electrical readout based on hybrid GaAs MQW modulators bonded to 0.8 m silicon VLSI circuits. Electron. Lett. 31, 22 (Oct. 26, 1995), 1917 – 1918. 11. Lentine, A. L., et al. 700Mb/s operation of optoelectronic switching nodes comprised of flip-chip-bonded GaAs/AlGaAs MQW modulators and detectors on silicon CMOS circuitry. Proc. Conference on Lasers and Electro-optics (CLEO). Baltimore, May 1995, Paper CPD-11. 12. Krishnamoorthy, A. V., et al. Arrays of optoelectronic switching nodes comprised of flip-chip bonded modulators and detectors on CMOS silicon circuitry. IEEE Photon. Technol. Lett. 8, 2 (Feb. 1996), 221–223. 13. Tummala, R. R., and Rymaszewski, E. J. Microelectronics Packaging Handbook. Van Norstrand Reinhold, New York, 1989. 14. LH9124 Digital Signal Processor Databook. SHARP Electronics Corporation, 1992. 15. Fleming, M. Record setting, high performance, modular DSP chip set. DSP Multimedia Technol. (Dec. 1994), 19 –33.

RICHARD G. ROZIER is a research assistant at the University of North Carolina at Charlotte. He received his B.S.E.E. and M.S.E.E. from UNCC in 1992 and 1996, respectively. He is working on his Ph.D. in electrical engineering and is studying the automated design and testing of applicationspecific integrated circuits through the use of hardware description languages. He is a member of IEEE. FOUAD E. KIAMILEV is an assistant professor of electrical engineering at UNC Charlotte. His area of expertise is in submicrometer CMOS VLSI chip design, smart pixels, signal processing hardware, and high-speed network protocols. Dr. Kiamilev is currently working with Lucent on an AFOSR funded program to demonstrate smart pixel chipsets for image computing applications. He received his Ph.D. and B.S. in electrical engineering and computer science from the University of California at San Diego in 1992 and 1988, respectively. Dr. Kiamilev has published over 14 journal papers and 40 conference publications in the area of smart pixels. He is a member of OSA and IEEE. ASHOK V. KRISHNAMOORTHY received the B.S. (Hons.) from the California Institute of Technology in 1986, the M.S.E.E. from the University of Southern California in 1988, and the Ph.D. in electrical engineering from the University of California at San Diego in January 1993. After spending a year as a research scientist in the Electrical and Computer Engineering Department at UCSD, he joined the Advanced Photonics Research Department at Bell Laboratories as a member of technical staff in 1994. His research interests are in the area of optical interconnects, optoelectronic–VLSI smart pixels, and their application to parallel computing, memory access, and switching. He has published over 20 journal papers, 60 conference papers, and 1 book chapter, presented over 10 conference invited talks, and filed more than 10 patents. He has served on several conference program committees for the IEEE Laser and Electro-Optics Society (LEOS), IEEE Computer Society (CS), and the Optical Society of America (OSA), and is the program chair for the 1997 IEEE LEOS Workshop on High Speed Interconnections within Digital Systems. Dr. Krishnamoorthy is a member of Tau Beta Pi, Sigma Xi, OSA, and SPIE.

Received December 15, 1995; revised February 15, 1996; accepted October 21, 1996