Network Processors : A n Introduction to Design Issue s Patrick Crowley University of Washingto n Mark A . Frankli n Washington University in St . Loui s Haldun Hadimiogl u Polytechnic University, Brooklyn Peter Z . Onufryk Integrated Device Technology, Inc . The objective of this book is to survey current issues in the design of network processors : high-performance, programmable devices designed to efficiently execut e
communications workloads . A number of factors have contributed to the development of network processors (NPs) and the NP industry . Over the last 25 years , VLSI circuit performance and the number of transistors on a die have dramaticall y increased following Moore's law . This has enabled the cost-effective use of relatively high-performance embedded processors for communications functions . There has also been an equivalent increase in telecommunications bandwidth , which, in turn, has been driven by a rapidly growing demand for more functionality and intelligence within the communications network . The term network processor is used here in the most generic sense and is meant to encompass every thing from task-specific processors, such as classification and encryption engines , to more general-purpose packet or communications processors . Figure 1 .1 shows an NP in a typical router line card application . In this application, the NP must examine packets at line speed (e .g ., OC3 to OC768 ) and perform a set of operations ranging from basic packet forwarding to complex queuing and quality-of-service (QoS) processing . The real-time processing demands imposed on NPs lead to the use of advanced and novel computer architectures along with the latest VLSI and packaging technologies . Not only mus t NPs achieve high performance, they also must have the flexibility to deal with
Network Processors : An Introduction to Design Issue s
Line interface , conditioning, -A ■- NP framing
Memories, CAMs , special functions
Other lin e card s Switc h
Line card Host control processo r
1 .1
Network processor in a router application .
FIGURE
the large and ever-changing set of communications protocols and the increasing demands for new and more complex network services . For some functions , flexibility can be achieved by providing various levels of programmability . Fo r other functions, the real-time demands on NPs dictate the use of dedicated hard ware . Thus, the NP designer is forced to balance three key elements against on e another: ♦ Real-time processing constraints ♦ Flexibility ♦ Executing the preceding two elements in a cost-effective and competitiv e manner given physical constraints related to VLSI technologies, packaging , and power This book considers various aspects of this difficult task . The first part presents a set of research papers devoted to questions related to analyzing, simulating, an d designing network processors . The second part begins with an industry analyst' s perspective of the NP field, and then continues with a set of papers contribute d by companies that describe commercial network processors . The remainder of this chapter introduces some of the design challenge s and corresponding architectural approaches currently used in NP research an d development . To this end, problems and solutions are merely sketched ; greate r detail can be found elsewhere in the book .
1 .1 Design Challenge s
1 .1
DESIGN CHALLENGES There are many challenges associated with designing network processors . Many of these are common to general-purpose processor and VLSI design, includin g external memory bandwidth, power dissipation, pin limitations, packaging, an d verification . However, the specific requirements of NPs exacerbate these problems . Although general-purpose processors typically have been designed to improve common-case performance with slight regard for certain design elements such as power efficiency, NPs, due to the dual concerns of real-time, link-rat e processing and port density, must emphasize worst-case performance in an area and power-efficient manner . Additionally, NP design involves a host of other systems challenges, including high levels of device integration (on-chip interface s and controllers for external memories, switch fabrics, co-processors, network interfaces, etc .) ; management of critical shared resources in a chip-multiprocesso r environment (e .g ., shared program state, memory interfaces) ; compiler an d software design for high-performance, real-time, parallel, and heterogeneous systems ; and real-time system verification . As an example, we now consider two N P design challenges : line speed and application complexity. As line rates have increased, the time associated with processing a minimum sized packet has decreased . Consider, for example, a line rate of 10 Gbps alon g with the simplifying assumption of no interpacket gap . Under these conditions, a stream of minimum-sized packets of 64 bytes will result in the arrival of a packe t approximately every 51 nanoseconds . While a stream of minimum-sized packet s does not generally represent average traffic, it is under some circumstances th e worst-case condition . Buffering and queuing will not help in this situation since , in order to process each packet and meet QoS guarantees, the processing rat e must be at least slightly higher than the incoming packet rate . Otherwise, packet s will be lost indiscriminately and queues will build up indefinitely . Given a single-issue embedded RISC processor that executes a single instruction per clock cycle at a clock frequency of 500 MHz and assuming no hazard s or memory delays, each instruction executes in 2 nanoseconds . The result is that for the minimum-sized packet and line rate described, only about 25 instruction s can be executed in one packet time . Since it is difficult to accomplish much in 2 5 instructions with a "standard" RISC instruction set, high-performance NPs (i .e . , those oriented toward the core of the network) resort to various design technique s to address this challenge . This is considered in the next section . As one moves away from the core of the network to the edge where flows are aggregated, line rate s dramatically decrease . In these edge applications, it is possible to process packets purely in software on a standard RISC processor ; however, cost and power constraints in these applications often drive designers to more innovative solutions .
Network Processors : An Introduction to Design Issues
One might expect that the increase in processor performance enabled b y Moore's law would yield enough computing power to keep up with the increas e in line speeds . However, that is not the case : over the past 10 years, line speeds an d overall bandwidth have increased even faster than processing power . The increase in line speeds is due to the incorporation of fiber-optic links and associated high speed electronics, whereas the increase in overall bandwidth is due to advance s in fiber-optic technologies such as WDM (wavelength division multiplexing) . Added to these challenges is the increasing complexity of network applications that customers are requiring . A simple view of complexity partitions applications into the following three domains : ♦ Applications that operate on individual packet headers (e .g ., routing an d forwarding) . ♦ Applications that operate principally on individual packet payloads (e .g . , transcoding) . ♦ Applications that operate across multiple packets within a single flow (e .g . , certain encryption algorithms) or across multiple flows (e .g ., QoS and traffic shaping) . A "flow" is considered to be a single source-destination session . Early networking applications and associated functions focused primarily o n the first item, that is, on dealing with packet headers with the principal applicatio n being that of determining the forwarding address associated with a given packet . More recently, applications have focused on applications in the second two categories. One aspect of the increasing complexity associated with these application s is the current need to perform packet classification on incoming packets . This requires matching selected fields in a packet with stored bit patterns and then ap propriately processing the packet . Since the number of patterns, positions, an d resulting actions can be very large, this can be a time-consuming process . Another example is the problem of encryption/decryption . Studies indicate tha t certain encryption/decryption algorithms are roughly two orders of magnitud e more complex than typical header processing applications . When operating a t high line rates, real-time solutions to these more complex applications represen t challenging design problems .
1 .2
DESIGN TECHNIQUES A variety of architecture techniques have been employed to address the issue s discussed in the previous section . These techniques can be broken down into three categories :
1 .2 Design Technique s
♦ Application-specific logi c • Extending the RISC instruction set • Use of customized on-chip or off-chip hardware assist s ♦ Advanced processor architectures • Multithreadin g • Instruction-level parallelism ♦ Macroparallelis m • Multiple processors • Pipelined processors For certain applications, selected time-consuming subtasks can be identifie d and implemented as new instructions in a standard RISC processor instructio n set architecture (ISA) . Examples of this include bit matching operations, pointe r addressing calculations, tree and other data structure searching operations, an d CRC polynomial calculations . Naturally, care must be taken that, in implementin g these customized instructions, processor clock cycle times are not extended and instruction pipeline stalls are avoided . This approach is widely used since it preserves downward compatibility with existing code, operating systems, and tools . Furthermore, using modern development techniques, the process of adding instructions to an existing instruction set architecture and modifying the development tool chain (e .g ., assemblers, compilers) is often a relatively straightforward task. A key area of research, however, is in identifying the optimal set of customized instructions to implement for a given application . Additionally, though compilers can be extended to recognize new instructions that have been explicitly included in a program, compiler methods for automatically generating these ne w instructions is still an area of research . However, the use of customized instructions does not work well for all applications and functions . For more complex functions and applications, large r blocks of hardware/logic may be needed to meet real-time constraints . In thes e cases, depending on die area, speed, and cost constraints, the functions may b e implemented as accelerators or hardware assists located either on-chip or offchip . Examples of this include hardware accelerators for encryption/decryptio n and classification . In both of these cases, companies have developed an d marketed specialized stand-alone chips implementing these applications whil e others have implemented these functions directly on their NP chip . Another approach to meeting real-time application and bandwidth demand s is to move toward more advanced processor designs . One common problem with NP applications is that they require access to large tables and data structure s that are held in off-chip memory. With a nonmultithreaded processor, when an
Network Processors : An Introduction to Design Issue s
application program makes a reference to data that is not already in its cache , it will stall for a number of cycles waiting for the data to be fetched from off chip memory. On a multithreaded processor, once such a stall is detected, th e processor switches automatically to another application process or thread . This thread can then start executing, thereby utilizing otherwise wasted stall cycles . Thus, processors are used more efficiently and, as a consequence, more packets can be processed per second with a given set of resources . Numerous NP designs employ multithreading techniques . Although this does not alleviate the problem of meeting the given input line speed, it does enhance processor efficiency an d throughput . Another approach to improving processor performance is to exploit instruction level parallelism (ILP) in programs . This is a very broad topic encompassin g a host of design variants . The main idea is to use either the compiler (static tech niques) or a hardware instruction scheduler (dynamic techniques) to determin e if a group of program instructions can be executed simultaneously on the give n set of processor resources . If so, then the instructions are executed in parallel , and the application potentially runs faster . Examining alternative ILP implementation strategies is beyond the scope of this introduction . ILP techniques have not yet enjoyed widespread use in commercial network processors since macro parallelism is generally considered a more efficient method of achieving speedu p in packet processing applications . Employing macroparallelism is very common in NP designs . One approach is based on the fact that at the flow level, different flows can be considered to be independent . Therefore, if a packet scheduler can, in a balanced manner, route flows to different processors that act independently and in parallel, the application processing associated with these flows can be executed in parallel . Thus , with n processors, we can theoretically achieve an n-fold speedup in processing . Parallelism can also be achieved at the individual packet level within a flow i f mechanisms are employed to maintain packet ordering . Further increases in processing power can be obtained if we consider the se t of tasks associated with packet processing as a sequence that can be executed i n a pipelined manner. Thus, a packet might go through a sequence of pipeline processing stages involving classification, packet processing (e .g ., forwarding, en cryption), and output processing (e .g ., QoS, queuing) . By doing this, packe t throughput of the processor is improved (at the expense of an increase in packe t latency) . A research question of interest is just how we should allocate tasks an d subtasks to pipeline stages in a manner that optimizes the cost performance . Thi s is a difficult question since pipeline stages can be viewed as being programmabl e processors, customized logic, or some combination of the two . Given the addi-
1 .3 Challenges and Conclusions
Column memory
al.
Column memory
PEO
PE1
PE2
PE 3
PE4
PE5
PE6
PE 7
►
~I
---►J
PE9
PE10
PE11
PE12
PE13
PE14
PE1 5
V
Column memory
1 .2
.1
PE8
Column memory
The Cisco Toaster2 NP (from Chapter 3) .
FIGURE tional options involving application algorithms and implementation, the numbe r of choices is enormous. Figure 1 .2 exemplifies macroparallelism . It shows a simplified view of th e Cisco Toaster2 NP architecture . The architecture consists of 16 processing elements divided into 4 pipelines, each having 4 stages . Details on the operation o f this network processor can be found in Chapters 3 and 12 .
1 .3
CHALLENGES AND CONCLUSION S Designing cost-effective network processors is one of the most challenging of cur rent computer architecture problems . In this introduction, we have touched o n some of the central architecture themes that cut across a number of current NP de signs . The general issue is that both the design space and the application space ar e very large . Additionally, both are changing as the underlying hardware technology
Network Processors : An Introduction to Design Issues
continues to rapidly improve, and as protocol standards and driving application s continue to evolve . Naturally, there are key topics that, though important, we have not addressed here . Not the least of them is the issue of NP software an d programmability. A key feature driving the use of NPs is the ability to change functionality in response to the changes cited earlier . Providing this flexibility is a difficult tradeoff since different locations in the network (e .g ., the core versus th e edge) can have significantly different flexibility and performance requirements . It is therefore unlikely that a single approach will work across the entire network . Part I of this book presents research directed at issues associated with th e NP design process. These include developing NP benchmarks, modeling performance, and using these models to aid in selecting designs and in generating cod e for them . Part II presents descriptions of commercial network processors that focus on the architecture, performance, and software aspects of these devices . As you read through Part II, you will find illustrations of most of the architecture elements discussed in this introduction . Industry has led the way in NP design, an d it is only recently that academic research has begun to focus on the interestin g and important problems associated with processor design in this environment.