THE JOURNAL OF CHINA UNIVERSITIES OF POSTS AND TELECOMMUNICATIONS Volume 15, Issue 2, June 2008
ZHENG Zhao-xia, ZOU Lian-ying, GAO Jun
Power optimization and performance improvement for embedded Ethernet SOC CLC number
TN4; TP393
Document A
Article ID 1005-8885 (2008) 02-0102-05
Abstract Information appliance is the combination of traditional home appliances and the internet technology. In this article, an Ethernet controller system-on-chip (SOC) solution for information appliances is presented. To achieve high performance, the embedded 8 bits 8051 micro control unit (MCU) is optimized by an independent instruction bus and a data bus. Besides, a two-stage pipeline feature is added. Compared with the existing 8051 core, the enhanced one-cycle MCU offers ten times improvement in instruction execution efficiency. Meanwhile, the performance of media access control (MAC) circuit is greatly improved by adopting various techniques such as direct memory access (DMA) control, paging strategy, etc. To reduce the power consumption, clock gating, low power supply, and multi-working-clock are adopted. Moreover, to achieve rapid data communication in different clock frequency circuits, a simple ping-pong first in first out (FIFO) circuit is realized. The chip is implemented using TSMC 0.25 μm two-poly four-metal mixed signal complementary metal oxide semiconductor (CMOS) technology. Its die area is 4.8 mm u
increasingly popular. In practice, the existing Ethernet solutions for information appliances are mainly composed of three chips: an embedded processor device (for instance, 8051), a network transceiver (for instance, DM9008), and an electrically erasable programmable read-only memory (EEPROM) (for instance, 93C46). Such multi-chip solution, however, needs high power consumption, large printed circuit board (PCB) area, and high material cost. As a matter of fact, the required throughput of Ethernet packets for information appliance applications is lower than personal computer (PC) applications; meanwhile, it is considerably cost sensitive, and low power consumption is also a key issue [1]. With the rapid development of modern silicon technology and very large scale integration (VLSI) technology, it is possible to integrate the different integrated circuits(ICs, such as mentioned above) into one single chip, an SOC solution. In this article, we propose a low-cost, low-power, and high-performance Ethernet controller SOC solution for information appliance applications. Various functions are implemented on a single chip by combing high-speed network transceiver and a general-purpose processor. The required performance and cost efficiency are achieved from its optimized architecture and novelties. This article is organized as follows. In Sect. 2, the embedded Ethernet controller system architecture is depicted. In Sect. 3 and Sect. 4, high-performance circuits and the low-power techniques applied for them are described in detail. In Sect. 5, the layout design and test results are demonstrated. Finally, the conclusions are drawn in Sect. 6.
4.6 mm. The test results show that the maximum throughput of Ethernet packets can reach 7 Mb/s while the power consumption is rather lowthe working current is just about 200 mA. Keywords low power technology, hardware/software co-design, circular buffer, throughput
1
Introduction
Information appliances are the combination of traditional home appliances and the modern Internet technology. With the market growing rapidly, information appliances are becoming Received date: 2007-06-12 ZHENG Zhao-xia ( ), GAO Jun Research Center for VLSI and Systems, Huazhong University of Science and Technology, Wuhan 430074, China E-mail:
[email protected] ZOU Lian-ying School of Electric and Information Engineering, Wuhan Institute of Technology, Wuhan 430074, China
2
System architecture
In this article, we propose the hardware and software Co-Design [2] that is optimized at both architecture and circuit levels [3]. The software mainly deals with the transmission control protocol/ internet protocol (TCP/IP). Figure 1 illustrates the overall hardware architecture. We integrate together a general-purpose micro-processor, a high-performance MAC, a two-channel DMA controller, two ping-pong FIFO buffers, a
Issue 2
ZHENG Zhao-xia, et al.: Power optimization and performance improvement for embedded Ethernet SOC
media independent interface (MII) unit, and an embedded static random access memory (SRAM). There are also some peripheral blocks such as analog to digital converter (ADC), and three 16 bits timer/counters. Supporting functions for system configuration are provided, including on-chip flash, a serial communication interface, an inter-integrated circuit (I 2C) interface, interrupt controller (INTC), and relevant Input/Output (I/O) ports.
Fig. 1
Hardware architecture of Ethernet controller SOC
The 32 kB read only memory (ROM) acts as a BootLoader for the software stored in the 128 kB on-chip flash memory. In this embedded system, the software is referred to as firmware, which can be updated or upgraded through Ethernet or universal asynchronous receiver transmitter (UART) port or I2C interface. This SOC adopts multi-clock system to reduce the power consumption; the MII unit and peripheral circuits work at 25 MHz, and other major circuits operate at the MCU working frequency – 50 MHz. Two ping-pong FIFO buffers are inserted for fast communication between these two different working clock circuits. The 32 kB built-in SRAM is used as a data memory except 256 B for variables and stack. To improve the throughput of MAC, MAC can access the SRAM data buffer directly through two independent DMA channels: the transmit DMA channel, which reads data from SRAM and sends it via the TxFIFO to the Internet, and, vice versa, the receive DMA channel, which receives the data from Internet via RxFIFO and writes it to the SRAM buffer. Owing to the whole-sharing property of the on-chip SRAM, the priority-arbitration is put for this 32 kB data-sharing area. The priority-arbitration level is set as follows: first-MCU needs fast read/write speed; second- is the receive process of the network data, which is more urgent than the third- transmit process; and the last is to verify calculation of the TCP/IP protocol. Keeping to such a priority sequence, the arbitrary control logic was set to settle all requests to access the SRAM sharing-data area.
3 3.1
103
Major high-performance circuits Enhanced 1-cycle MCU
8051 is one of the most popular 8 bits microprocessor architecture. It however takes 12 clock cycles to complete a machine cycle, and 14 machine cycles for an instruction. Thus, for traditional 8051 systems, running an instruction needs 12 to 48 clock cycles [4]. The whole capability is limited at 3 million instructions per second (MIPS) at the most even if the crystal oscillator’s frequency is up to 40 MHz. Practically, there are several dumb cycles in such a scheduling plot, which hold back the capability for high-speed Ethernet devices. The higher performance calls for a new infrastructure of a reduced instruction set computer (RISC) pipeline microprocessor, which can be still compatible to the standard 8051 instruction set. In this article, we propose an intellectual property (IP) core for renewable RISC microprocessor, which takes a bi-bus structure in the Harvard-style (the instruction bus and the data bus are separated). Besides, wasted bus states are eliminated, and by means of fixed-length instruction set, we can eventually get lower hardware cost and faster computation speed. To improve the performance of the MCU, we divide an instruction execution process into two steps: fetch and execution. We use a two stage pipeline to execute more instructions in parallel. Most of the single byte instructions are performed in a single clock cycle. As to complicated instructions that cannot be completed in one cycle, the execution phase is purposely prolonged, and the last clock cycle can be used simultaneously to fetch the next instruction. This multi-cycle instruction pipeline is illustrated in Fig. 2.
Fig. 2 Multi-cycle instruction pipeline
The enhanced MCU compatible with standard 8051 instructions shows improved performance and less power consumption, compared to the original versions. It can operate at an average rate of about one instruction per cycle, offering great improvement in the instruction execution speed. Consequently, more throughputs are possible for the same crystal oscillator’s frequency.
104 3.2
The Journal of CHUPT Circular buffer for DMA channel
The minimum length of an Ethernet frame is 64 B, and the maximum length is 1 518 B. To utilize the receive DMA channel buffer efficiently, paging strategy is deployed to manage the data buffer. We set 256 B length for each page; thus, it takes only one page for the shortest packet and 6 pages for a possible longest packet. The DMA control module takes two static registers to mark the address of the receive buffer: the start-page address register and the end-page address register. When the SOC system is reset, the start/end-page address register is set according to the physical receive buffer address. The receive buffer address is a continuous linear space physically, while logically, we can regard such a buffer as a circular multi-page structure, as shown in Fig. 3.
Fig. 3 Structure of the circular buffer
In Fig. 3, two dynamic registers are used to mark the read/write status of the receive buffer: the current read page address register and the current write page address register. These two registers record the logic page address instead of the physical address. When the SOC system is initialized, both are set to the start-page address. The received network data are stored at the beginning from 2 B offset of each page (the 2 B are used to indicate the data frame length, data status, and next page address). It takes 6 pages for a maximum Ethernet frame. The page address should be allocated continuously to reduce the complexity of the control logic circuits. When the current read/write page address gets to the last page available, the DMA control logic will re-initialize the current read/write page address registers to the start page address. When the current write page address register increases to the same with the current read page address, the buffer is full; the subsequent network data will be rejected until the next empty page(s) is available. When the current read page address register increases to the same with the current write page address, the whole buffer is emptied, and then no valid data is available.
2008
By means of such a circular page buffer structure, the complexity of circuit control logic is notably reduced, and the DMA channel buffer utilization is improved remarkably.
4
Low-power design
Power is a secondary consideration following speed; the power dissipation in this CMOS Ethernet controller SOC comes from two components: the static dissipation components and the dynamic dissipation components. (1) Ptotal Pstatic Pdynamic Dynamic dissipation is far greater than static power when systems are active, and hence, static power is often ignored. The primary dynamic dissipation component is charging the load capacitance. Suppose a load capacitance is switched between ground and VDD at an average switching frequency of fsw. For any given interval of time T, the load will be charged and discharged Tfsw times. Current flows from VDD to the load to charge it. Current then flows from the load to ground during discharge. In one complete charge/discharge cycle, a total charge of Q CVDD is thus transferred from VDD to ground. The average dynamic power dissipation is: T 1 T V (2) Pdynamic iDD (t )VDDdt ˙ DD ³ iDD (t )dt T ³0 T 0 Taking the integral of the current over some interval T as the total charge delivered during that time, we simplify as V (3) Pdynamic˙ DD (Tfsw CVDD ) CVDD 2 fsw T Since most gates do not switch every clock cycle, the switching frequency fsw can be expressed as an activity factor Į times the clock frequency f. Now, the dynamic power dissipation may be rewritten as: (4) Pdynamic D CVDD 2 f Thus, the dynamic power dissipation can be reduced by decreasing the activity factors Į, the switching capacitance C, the power supply VDD, or the operating frequency f. Clock gating is used for this Ethernet controller SOC design to reduce the activity factor; some unnecessary circuits can be turned off when other required circuits are executing. Also, device-switching capacitance is reduced by choosing small transistors. Minimum-sized gates are used on non-critical paths. Interconnect switching capacitance is most effectively reduced through careful floor planning, and placing communicating units near each other to reduce wire lengths. The voltage has a quadratic effect on dynamic power; we choose a lower power supply to reduce the power consumption, and the frequency can also be traded-off for power consumption. We lower the non-critical circuits in this SOC: a multi-clock system is
Issue 2
ZHENG Zhao-xia, et al.: Power optimization and performance improvement for embedded Ethernet SOC
implemented to reduce the power consumption greatly; the MII unit and other peripheral blocks work at 25 MHz, while the MCU and some major circuits work at 50 MHz.
5
105
per cycle, offering great improvement in the instruction execution efficiency.
Layout and test results
The chip is successfully implemented using TSMC 0.25 μm two-poly four-metal mixed signal CMOS technology. Its die area is 4.8 mm u 4.6 mm. Figure 4 shows the whole chip layout. SRAM data buffers are located near the corresponding logic blocks. To reduce the power consumption of the internal bus operations, the asynchronous MII unit and the ping-pong FIFO are placed as close as possible to the relevant Pins. The one-cycle 8 bits MCU and other peripherals are integrated together.
Fig. 5 Test environment of the test chip Table 1 Statistical result of the 8 bits register data multiplication in standard 8051 and modified MCU Instruction
Machine code
Instruction length/B
Standard 8051/clock cycle
Our MCU/ clock cycle
MOV A, Rx
E8h-EFh
1
12
1
MOV B, Ry
88h-8Fh
2
24
2
MUL AB
A4h
1
48
2
MOV Rx, A
F8h-FFh
1
12
1
Total clock cycle
96
6
Throughput is one of the most important parameter for an Internet device [8]. According to IEEE802.3 protocol, throughput is defined as TP (8 u 2 Len ) /106 RTT , where, TP is the throughput (Mb/s); Len is the frame length (B); RTT is the Fig. 4 Whole chip layout
After hardware and software implementation, the testing chip is tested [5] by the environment [6] as shown in Fig. 5. We set the Ethernet controller SOC working in MII mode, and use an external MII interface module to generate the Ethernet packets for this SOC. We use software Keil to compile the firmware and create hex file for bootROM [7]. Through the UART port, we can download new firmware to internal flash and observe the detail information of the embedded network controller chip. The capacity of the enhanced MCU is validated through 8 bit multiplication, as shown in Table 1. For two 8 bits register multiplication: Rx Rx Ry , which includes four specific instructions: MOV A, Rx; MOV B, Ry; MUL A B, and MOV Rx , A. From Table 1, standard 8051 consumes 96 clock cycles to complete this computation while the enhanced MCU only takes 6 clock cycles. Compared with standard 8051, the enhanced MCU provides greatly improved performance and power consumption. It can operate at a rate of one instruction
round trip time(s). By connecting the PCB test board to a SmartBit tester [8], we can test the throughput in 3 types of data: Ethernet packets, user datagram protocol (UDP) packets, and TCP packets. Figure 6 depicts the performance. It can be found distinctly that the maximum throughput of Ethernet packets reaches up to 6.8 Mb/s, that of UDP packets reaches up to 5.8 Mb/s, and that of the TCP packets reaches up to 3.4 Mb/s while the maximum throughput of multi-chip solutions is lower than 3 Mb/s. This is because the multi-chip system needs access to the external storage space, and the visiting delay will be far greater than the built-in storage space, and moreover, the timing cannot achieve the best match state between the corresponding embedded micro-controller and the external network controller. The test result shows that the embedded Ethernet controller single-chip solution has great improved performance than the multi-chip solutions. The overall system power consumption is very low, and the working current is about 200 mA as illustrated in Fig. 7.
106
The Journal of CHUPT
2008
References 1. Zhu Yue, Rong Meng-tian. High-performance, low-cost joint equalizer and trellis decoder for 1000Base-T Gigabit Ethernet transceiver. The Journal of China Universities of Posts and Telecommunications, 2007, 14 (2): 106111 2. Tabbara B, Tabbara A. Function/architecture optimization and co-design of embedded systems. Boston, MA, USA: Kluwer Fig. 6 Throughput performance diagram
Academic Publishers, 2000: 226244 3. Thomas F, Nayak M M, Udupa S, et al. A hardware/software co-design for improved data acquisition in a processor based embedded system. Microprocessors and Microsystems, 2000, 24(3): 129134 4. Clark L T, Hoffman E J, Miller J, et al. An embedded 32-b microprocessor core for low-power and high-performance applications. IEEE Journal of Solid-State Circuits, 2001, 36(11): 15991608 5. Yang Woo seung, Chung Moo Kyeong, Kyung Chong Min. Current status and challenges of SOC verification for embedded systems market. Proceedings of 2003 IEEE International SOC Conference, Sept 1720 2003, Portland, OR, USA. Piscataway,
Fig. 7 Overall system power consumption
NJ, USA: IEEE, 2003: 213216 6. Sehgal A, Iyengar V, Chakrabarty K. SOC test planning using
6
Conclusions
virtual test access architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2004, 12(12): 12631276
In this article, an embedded network controller conforming to IEEE 802.3 standard is presented. A one-cycle 8 bits MCU, a MAC with two DMA channels, an embedded SRAM, a flash, and peripheral blocks are integrated together. To fulfill the requirements for information appliances, its architecture is optimized especially for power consumption and performance. In addition, various high-performance circuit techniques such as DMA, circular buffer are proposed to improve the throughput of the major building blocks. As a result, the SOC working current is about 200 mA while the maximum throughput of Ethernet packets can reach 6.8 Mb/s. The chip is successfully implemented using TSMC 0.25 μm two-poly four-metal mixed signal CMOS process, resulting in the die size of 4.8 mm u 4.6 mm including I/O cells.
7. Madsen J, Mahadevan S, Virk K, et al. Network-on-chip modeling for system-level multiprocessor simulation. Proceedings of the 24th IEEE International Real-Time Systems Symposium (RTSS’03), Dec 35, 2003, Cancun, Mexico. Piscataway, NJ, USA: IEEE, 2003: 265274 8. Jeffrey I, Gilmore C, Siemens G , et al. Hardware invariant protocol disruptive interference for 100BaseTX Ethernet communications. IEEE Transactions on Electromagnetic Compatibility, 2004, 46(3): 412422 Biography: ZHENG Zhao-xia, lecturer and Ph. D. Candidate in the Department of Electronic Science and Technology, Huazhong University of Science and Technology. Her research interests are MAC hardware circuits implementation.
Acknowledgements
This work is supported by the Hi-Tech
Research and Development Program of China (2006AA01Z226).