Dynamically reconfigurable hardware–software architecture for partitioning networking functions on the SoC platform

Dynamically reconfigurable hardware–software architecture for partitioning networking functions on the SoC platform

The Journal of Systems and Software 82 (2009) 1588–1599 Contents lists available at ScienceDirect The Journal of Systems and Software journal homepa...

1MB Sizes 1 Downloads 158 Views

The Journal of Systems and Software 82 (2009) 1588–1599

Contents lists available at ScienceDirect

The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss

Dynamically reconfigurable hardware–software architecture for partitioning networking functions on the SoC platform Youngmann Kim a, E.K. Park b, Sungwoo Tak a,* a b

School of Computer Science and Engineering, C-26 Office 318, 30 Jangjeon-dong, Geumjeong-gu, Pusan National University, Busan 609-735, South Korea School of Computing and Engineering, University of Missouri, Kansas City, MO, USA

a r t i c l e

i n f o

Article history: Available online 20 March 2009 Keywords: System on Chip Network protocols Hardware–software co-design Reconfigurable hardware–software architecture

a b s t r a c t We present an issue of the dynamically reconfigurable hardware–software architecture which allows for partitioning networking functions on a SoC (System on Chip) platform. We address this issue as a partition problem of implementing network protocol functions into dynamically reconfigurable hardware and software modules. Such a partitioning technique can improve the co-design productivity of hardware and software modules. Practically, the proposed partitioning technique, which is called the ITC (Inter-Task Communication) technique incorporating the RT-IJC2 (Real-Time Inter-Job Communication Channel), makes it possible to resolve the issue of partitioning networking functions into hardware and software modules on the SoC platform. Additionally, the proposed partitioning technique can support the modularity and reuse of complex network protocol functions, enabling a higher level of abstraction of future network protocol specifications onto the SoC platform. Especially, the RT-IJC2 allows for more complex data transfers between hardware and software tasks as well as provides real-time data processing simultaneously for given application-specific real-time requirements. We conduct a variety of experiments to illustrate the application and efficiency of the proposed technique after implementing it on a commercial SoC platform based on the Altera’s Excalibur including the ARM922T core and up to 1 million gates of programmable logic. Ó 2009 Elsevier Inc. All rights reserved.

1. Introduction Currently, multimedia applications using the TCP/IP protocol stack vary in a wide range of devices. The trend in the embedded network/multimedia system design is to integrate all of the major system functions and network protocol cores into a single chip called the SoC (System on Chip). Recently many companies have understood the need for the Internet connectivity of embedded systems and have presented either ASIC (Application-Specific Integrated Circuit) or reusable IP (Intellectual Property) libraries that support the whole or a part of the network protocol stack such as the TCP/IP protocol stack. In general, network protocols implemented in hardware can be faster than those implemented in software because the hardware implementation of network functions avoids the sequential processing of individual functions, allowing the system to perform multiple computation activities at the same time through parallel processing. Specifically, the major reason for building system and networking functions in hardware is the satisfaction of performance constraints. Such performance constraints can be bounded with the overall latency to perform a given task or more specifically with the deadline to perform the task, and with * Corresponding author. Tel.: +82 51 510 2387; fax: +82 51 515 2208. E-mail address: [email protected] (S. Tak). 0164-1212/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2009.03.015

the ability to sustain specified input/output data rates over multiple executions of system and networking functions. In this paper we consider an issue of the reconfigurable hardware–software architecture which allows for partitioning networking functions into hardware and software tasks on the SoC platform. Note that tasks and processes essentially refer to the same entity. We also consider more complex data transfers between hardware and software tasks as well as provide real-time data processing simultaneously for given application-specific real-time requirements. To realize the hardware–software co-design platform supporting many different kinds of critical real-time applications and network protocols, the following three fundamental problems have to be studied. The first consideration of co-designing a hardware–software synthesis for an embedded real-time network system is to model the system functionality and timing constraints of real-time applications and network protocols. However, most hardware description languages, such as Verilog HDL (Hardware Description Language) and VHDL (Very-high-speed integrated circuit Hardware Description Language), only describe the system functionality as a set of computations performed by a computing element. Thus, we need to present the explicit specification for the real-time processing at a level of task unit as well as at a level of a piece of data frame unit between hardware and software modules. The second problem

Y. Kim et al. / The Journal of Systems and Software 82 (2009) 1588–1599

to be considered is the specification and synthesis of real-time intertask communication interface among software-to-software, software-to-hardware, and hardware-to-hardware tasks. The third part to be studied is the question of how to design a data frame unit with real-time features and to provide a real-time inter-task communication channel which concurrently carries inter-cooperative jobs between software and hardware modules. We use the ‘InterCooperative Job (ICJ)’ as a term similar to the concept for ‘message’, ‘frame’, ‘data unit’, etc., that is, a kind of discrete unit of information being communicated over the inter-task communication channel which allows for more complex data transfers among tasks. This paper addresses the feasibility of hardware–software codesign for network protocols on the SoC platform as well as the performance of real-time processing among hardware and software tasks. Additionally, we consider the functionality required for the real-time processing of application tasks and network protocol functions, and explain how the real-time processing can be implemented on a SoC platform. This paper is organized as follows: Section 2 investigates the related work that motivates us to work on the dynamically reconfigurable hardware–software architecture which allows for partitioning networking functions into hardware and software tasks on the SoC platform. Section 3 presents the dynamically reconfigurable SoC platform developed in this research work. Section 4 analyzes the performance of the ITC (Inter-Task Communication) technique incorporating the RT-IJC2 (Real-Time Inter-Job Communication Channel) which resolves the hardware–software co-design issue of partitioning networking functions into hardware and software tasks on the SoC platform. In Section 5, the proposed technique is evaluated in terms of the minimal average processor utilization, the minimal number of context switching, and of the minimal deadline miss ratio of hardware and software tasks. Such evaluation is conducted to measure how well the application-specific real-time processing and QoS (Quality of Service) functionality can be achieved on the proposed SoC platform. Section 6 concludes this paper.

2. Motivation In this section, we discuss the existing sub-problems of hardware–software co-design for embedded real-time network/multimedia systems in terms of system partitioning, RTOS (Real-Time Operating System) containing the real-time scheduling and synchronization between hardware and software components, hardware–software communication interfaces, application-specific QoS, and TOE (Traffic Offload Engine) for high-speed packet processing. Gupta et al. (1994), Lagnese and Thomas (1991), Olson et al. (2007), Banerjee et al. (2006), Paulin et al. (2006) have attempted to present a hardware–software partitioning technique. Gupta et al. (1994), Lagnese and Thomas (1991) consider the partition problem on the feasibility of hardware–software implementation and the satisfaction of non-real timing constraints. Olson et al. (2007) provide an excellent review of partitioning problems in terms of the following approaches: partitioning toward a preexisting or fixed hardware platform and traditional VLSI (Very Large-Scale Integration) partitioning. Banerjee et al. (2006) present a flexible method that focuses on physical and architectural constraints imposed on dynamically reconfigurable architecture. Related work addressed in Olson et al. (2007), Banerjee et al. (2006) ignores the issues of real-time scheduling and QoS constraints for given application-specific requirements. Recent work presented in Paulin et al. (2006) considers a flexible multiprocessor SoC platform that supports high-speed hardware-assisted messaging, context switching, and scheduling. However, it does not consider real-time scheduling incorporating the concept of task and message trans-

1589

mission deadline but only high-priority, fair-sharing based on a round-robin scheme, and best-effort scheduling. Mooney and Blough (2002), Gauthier et al. (2001), Lee et al. (2003), Nakano et al. (1999), Lahiri et al. (2004) have discussed the issue of implementing the RTOS kernel in software and hardware. The expectation is clear that the hardware RTOS kernel outperforms the software RTOS kernel. The work presented in Lee et al. (2003), Nakano et al. (1999) proposes a hardware RTOS unit called RTU (Real-Time Unit). The RTU is a hardware operating system that moves scheduling, IPC (Inter-Process Communication) such as semaphores as well as time management control such as time ticks and delays from the software RTOS kernel to the hardware RTOS kernel. Nakano et al. (1999) address that the total performance of hardware RTOS kernel including context switching is twice as fast as that of software RTOS kernel. However, it is not clear to generalize the fact that the hardware RTOS kernel always outperforms the software RTOS kernel. On the contrary, Lahiri et al. (2004) present a software RTOS optimization approach that is applied to the software RTOS kernel, which in some cases shows even better performance than the hardware RTOS kernel. However, experimentation conducted in Lahiri et al. (2004) is not exhaustive. Lahiri et al. (2004) propose that the addition of a layer of circuitry, called the CAT (Communication Architecture Tuner), around any existing communication architecture can enhance a system’s capability of adapting to the changing communication needs of its constituent components. The CAT considers the protocol performance parameters such as the number of missed packet deadlines and average packet processing time. Bolotin et al. (2004) define the QoS and cost model for communications in a SoC platform and classify SoC inter-module communication traffic into four classes of service: signaling for intermodule control signals, real-time for delay-constrained bit streams, short data access modeling, and block-transfer for large data bursts. However, Bolotin et al. (2004) do not address more details of specifying QoS and real-time guarantee techniques in terms of task and packet scheduling. The work presented in Panic et al. (2003), Dollas et al. (2005), Lofgen et al. (2005), Reigner (2004), Mogul (2003) discusses an architectural solution for protocol layer on chip implementation and the high performance issue in existing protocols such as the TCP/IP protocol suite. In terms of the architectural solution of protocol layer on chip implementation, the IEEE802.11a MAC (Medium Access Control) layer on chip implementation in Panic et al. (2003) and the TCP/IP protocol layer implementation in Dollas et al. (2005), Lofgen et al. (2005) have been presented. As discussed in Reigner (2004), Mogul (2003), one central question of TCP/IP high performance has been whether it is more appropriate to implement TCP in host CPU software, or in the network interface system. The latter approach is usually called TOE. Especially, Mogul (2003) addresses that TOE per se is neither of much overall benefit nor free from significant costs and risks. However it proposes that TOE in the service of very specific goals might actually be useful. In this paper, we attempt to reconsider these features, which have been reviewed so far in this section, and then we build a SoC platform for real-time embedded network/multimedia systems that can resolve a partition problem of implementing network protocol functions into dynamically reconfigurable hardware and software tasks running concurrently on a SoC platform.

3. Dynamically reconfigurable SoC platform Building a dynamically reconfigurable SoC platform for a realtime embedded network/multimedia system has become possible with four fundamental hardware and software co-design technologies: (1) dynamically reconfigurable architecture accommodating

1590

Y. Kim et al. / The Journal of Systems and Software 82 (2009) 1588–1599

to the enhancement and revision of future network protocol specifications, (2) the ITC interface incorporating the RT-IJC2, (3) application-specific real-time processing and QoS functionality through the fine-grained real-time processing at a level of a piece of data frame unit and the coarse-grained real-time processing at a level of task unit, and (4) high performance JOE (Job Offload Engine) alleviating application-specific real-time and QoS constraints. The architecture of a dynamically reconfigurable SoC platform proposed in this paper is shown in Fig. 1. Building the dynamically reconfigurable SoC platform was carried out using the Excalibur EPXA4 Device (Altera, 2007) with the LAN91C111 10/100 Ethernet chip (SMSC Corporation, 2008) produced by SMSC corporation. The Excalibur EPXA4 chipset has the ARM922T-processor core that supports the MMU (Memory Management Unit), 8K data and 8K instruction caches. The RTOS kernel running on the dynamically reconfigurable SoC platform is implemented by modifying and improving the version of uC/OS (Labrosse, 2002), which is not an actual RTOS but an embedded OS (Operating Systems). So we add three well-known real-time task schedulers into the uC/OS: T-RM, T-DM, and T-EDF schedulers – i.e. T-RM (Task scheduling with Rate Monotonic), T-DM (Task scheduling with Deadline Monotonic), and T-EDF (Task scheduling with Earliest Deadline First) schedulers (Cottet et al., 2002). The prefix T denotes a task scheduler. The basic fundamental consideration of designing the SoC platform is the principal division of system functions and tasks – the architecture of RTOS kernel, communication protocol tasks, and runtime environment. In the proposed SoC platform, the RTOS kernel provides the smallest possible set of services and resources on which the remaining services required can be built. The basic set of services includes a task abstraction, a set of real-time schedulers, and a service of task synchronization using binary and counting semaphores. With regard to the dynamically reconfigurable architecture accommodating to the enhancement and revision of future network protocol specifications, we first decompose each of all system functions including network protocol functions into the corresponding software tasks. The task decomposition technique can support the coarse-grained real-time processing by partitioning all significant system/networking functions into software tasks and keeping the deadline of tasks through a real-time scheduler. Subsequently, we select a candidate set of software tasks and rebuild them with Verilog HDL and VHDL to implement a set of hardware tasks. A hardware task is realized into a unit of concurrency targeted at programmable logic, FPGA (Field Programmable Gate Array) logic. Such hardware tasks mainly support TCP/IP pro-

tocol tasks or multimedia encoders/decoders and are incorporated in the FPGA logic of the SoC platform. Due to the task conceptualization of all system functions and network protocol functions, software and hardware tasks can be easily dynamically loaded on the SoC platform. Note that Lee et al. (2007) describe more inside details of how to implement hardware tasks in Verilog HDL and VHDL. In order to provide the real-time data exchange enabling bidirectional communication among any software and hardware tasks, we design a new unified inter-communication interface. Such an inter-communication interface is described as the ITC technique incorporating the RT-IJC2. All software in C and hardware tasks in Verilog HDL and VHDL attached to different AHBs (ARM Advanced High-performance Buses) have the identical programmable function specifications. A data frame unit generated by software or hardware tasks is contained in the ICJ frame unit, which is described in Section 1, and then the ICJ frame is exchanged through the RT-IJC2. The RT-IJC2 consists of a single or multiple ICJ schedulers and multiple ICJ Queues. We implement multiple ICJ Queues by exploiting the concept of the shared memory technique well known at operating systems. An ICJ frame is selected from an ICJ Queue and then scheduled by an ICJ scheduler. The RT-IJC2 provides a level of support for fine-grained but realtime communications among software tasks residing in the main memory and hardware tasks residing in the FPGA logic. As for real-time processing schedulers running on the RT-IJC2, we propose a new scheduler, ICJ-EDIT (Inter-Cooperative Job – Earliest Deadline Inheritance to the corresponding Task deadline). Besides, ICJ-FIFO (First-In and First-Out), ICJ-WFQ (Weighted Fair Queue), ICJ-RM, ICJ-DM, and ICJ-EDF schedulers are developed and exploited in experiments. The ICJ schedulers (i.e. ICJ-FIFO, ICJ-WFQ, ICJ-RM, ICJ-DM, and ICJ-EDF) except the ICJ-EDIT scheduler are similar to the characteristics of the corresponding wellknown task schedulers available in the literature. The proposed SoC platform has no notion of other bothersome arbiter operations built into it but a new system and a network function can be dynamically loaded as either a software task or a hardware task. The RT-IJC2 allows for real-time data transfers between software and hardware tasks. The maximum degree of openness and dynamic reconfiguration of software and hardware tasks can be achieved through the proposed technique described so far. Fig. 2 shows the structure of an ICJ frame exchanged over the RT-IJC2. The ICJ frame consists of the ICJ-Header and the ICJ-Data. In the ICJ-Header, the 8-bit Task ID field contains a unique numeric identifier of the task currently accessing and processing the ICJ

RTOS Architecture on the SoC Platform for Embedded Network/Multimedia Systems

EPXA4

RTOS Kernel Real-Time Task Scheduler Semaphore used for Synchronization among S/W and H/W Tasks

Bus#1: AHB1

Bus Bridge

Bus#2: AHB2

RT-IJC2 (Real-Time Inter-Job Communication Channel) ICJ (Inter-Cooperative Job) Scheduler ICJ Queue

:

ARM922T-Processor Core

FPGA

: Main Memory S/W Task (A) S/W Task (B) S/W Task (C)

Ethernet (LAN91C111)

Peripherals

ITC (Inter-Task Communication) Fig. 1. A dynamically reconfigurable SoC platform.

H/W Network Protocol Task #1

H/W Network Protocol Task #n

1591

Y. Kim et al. / The Journal of Systems and Software 82 (2009) 1588–1599

ICJ-Header

ICJ-Data Real-time Properties

8-bit

64-bit

16-bit

32-bit

Task ID Ownership Length Reserved

32-bit Release Time

32-bit Period

32-bit 32-bit Execution Deadline Time

8-bit Drop Policy

8 * (Length)-bit Data

Fig. 2. Structure of an ICJ frame consisting of header and data.

frame. Note that the first bit and the second bit are reserved and so the 6-bit Task ID field is used at this time. The 64-bit string contained in the Ownership field indicates that each bit is set to 1 when the corresponding task among 26 tasks occupies the ICJ frame. All 0’s in this field indicate that the ICJ frame will be freed by the RT-IJC2. The 16-bit Length field indicates the length of the ICJ-Data part in bytes. In the ICJ-Header, there are five fields with real-time properties. The 32-bit Release Time (i.e., denoted by ICJr) field indicates the time that the ICJ frame waits till its first release. The 32-bit Period (i.e., denoted by ICJp) field indicates the time between ICJ frame’s successive requests unless its value is equal to all 1’s. All 1’s in the 32-bit Period field indicate that the ICJ frame will be aperiodically processed. There are the 32-bit Execution Time (i.e., denoted by ICJe) field and the 32-bit Deadline (i.e., denoted by ICJd) field relative to the release time of an ICJ frame. The ICJ frame, which misses its deadline, will be discarded or not according to the value of 8-bit Drop Policy field where the value of 1 denotes Drop and 0 denotes No Drop. From our point of view, a periodic task si is described by four nonnegative numbers: Tri the release time or the time that the task waits till its first request of si, Tpi the period (time between periodic task si’s successive requests), Tei the execution time, Tdi the deadline, where si = (Tri, Tpi, Tei, Tdi). An aperiodic task si is described by si = (Tri, Tai, Tei, Tdi), where Tai denotes the average release rate of an aperiodic task. An ICJi,j frame denotes that a task si forwards the ICJ frame to a task sj. A periodic ICJi,j frame and an aperiodic ICJi,j frame are, respectively, represented with four primary parameters which imply the same meaning of parameters as used in periodic and aperiodic tasks; a periodic ICJi,j = (ICJri,j, ICJpi,j, ICJei,j, ICJdi,j) and an aperiodic ICJi,j = (ICJri,j, ICJai,j, ICJei,j, ICJdi,j). The operations of ICJ-EDIT scheduler mentioned earlier in this section are as follows: if the total utilization of a set of tasks is no greater than 1, an ICJ frame entering the ICJ Queue at time t can be feasibly scheduled by the following additional procedure; if {PU-T(t) + PU-ICJ(t) + PU-ICJA(t)} < 1, the ICJ frame that has already arrived at the ICJ Queue can be schedulable and the real-time properties of ICJ frame will be inherited to the task which needs to

Task

i

i

forwards ICJi,j destined to task

j

handle the ICJ frame at time t. The PU-T(t) denotes the Processor Utilization factor of all Tasks ready to run at time t. The PU-ICJ(t) denotes the Processor Utilization factor of all ICJ frames ready whose priority is higher than the that of ICJ frame that has just arrived at time t. The PU-ICJA(t) represents the Processor Utilization factor of the ICJ frame that has just arrived at time t.

4. High performance analysis of RT-IJC2 In this section, we analyze the response time of ICJ frames processed by several ICJ schedulers in terms of high performance evaluation of ICJ schedulers. The real-time processing of software and hardware tasks running on the SoC platform is handled by one of the real-time task schedulers – i.e. T-RM, T-DM, and T-EDF schedulers. Besides, as real-time applications running on the SoC platform, such as multimedia or videoconferencing, where ICJ frames containing voice and image data are being transmitted through the RT-IJC2, the need to deliver ICJ frames in a timely fashion is equally obvious. Since excessive delays in message delivery can significantly degrade the QoS required by given real-time applications, we consider the following two real-time features in message delivery: (1) real-time properties of the ICJ frame illustrated in Fig. 2 and (2) an ICJ scheduler, ICJ-EDIT, supporting the real-time exchange of ICJ frames between software and hardware tasks through the RT-IJC2. Let us now set out to find out how more efficient the proposed ICJ-EDIT scheduler is than the other ICJ schedulers in terms of the response time of ICJ frames. Such ICJ schedulers co-operate with each of the two real-time task schedulers (i.e. T-RM and T-EDF schedulers). Here are parameters used in the analysis of the average response time of the ICJ frame forwarded by the ITC incorporating the RT-IJC2. The response time of the ICJi,j frame, which task si forwards to task sj, implies the time difference between the ICJ frame submission by task si and the end of processing it by task sj. The response time of the ICJi,j frame is denoted by ICJrespi,j.

through the RT-IJC 2 Interface denotes RT-IJC 2 Interface

Te i

Release time of

time

i

ICJresp i,j

ICJri,j Submission time of ICJi,j Tresp jn-(k-1)

ICJb,j forwarded by b is processed by in the (n minus (k-1))-th round

ICJc,j forwarded by is processed by in the (n minus (k-2))-th round

Te j

j

HP{Wjn-(k-1)}

Tresp jn

Tresp jn-(k-2)

ICJi,j forwarded by is processed by in the n-th round

Te j HP{Wjn-(k-2)}

Te j HP{Wjn}

Fig. 3. An example of forwarding and processing ICJ frames.

time

1592

Y. Kim et al. / The Journal of Systems and Software 82 (2009) 1588–1599

Fig. 3 shows an example of forwarding and processing ICJ frames through the RT-IJC2 interface. In Fig. 3, we assume that there exist (k  1) ICJ frames that have already arrived at the ICJ Queue of task sj before the nth ICJ frame, ICJi,j frame, enters the ICJ Queue. Thus, as illustrated in Fig. 3, the response time of ICJi,j frame, ICJrespi,j, is calculated by summing the number of k Tresp nðk1Þ to Trespnj ; where Trespnj denotes values ranging from Trespj the nth response time of task sj. Let us assume that the ICJi,j frame requested by task si is processed by task sj at the nth round. The nth response time of task sj that processes ICJi,j frame is derived as follows: Trespnj ¼ HPfW nj g þ Tej  HPfW nj g denotes the time occupied by a set of higher priority tasks than task sj before task sj processes ICJi,j frame

0

X

B HPfqj g ¼ @ ðsl 2TsetP Þ^

P

ðsl 2HPfsj gÞ

Tel þ Tpl

X

X

ðsl 2TsetAP Þ^ðsl 2HPfsj gÞ

1

C Tel  Tal A

for all l ¼ 1; 2; . . . L

HPfW nj g ¼ HPfqj g  Trespnj ¼ HPfqj g  ðHPfW nj g þ Tej Þ ¼ HPfqj g  HPfW nj g þ HPfqj g  Tej ¼

HPfqj g  Tej 1  HPfqj g

ðl2Tset AP Þ^

¼

k1 X

Trespnl ¼ j

k1   X HPfW nl j g þ Tej

l¼0

l¼0

k1 X

HPfqj g  Tej þ Tej 1  HPfqj g

l¼0

! ¼

k1 X l¼0

! Tej 1  HPfqj g

ðl–jÞ

ð1Þ

Eq. (1) estimates HPfW nj g which is derived from HP{qj}. HP{qj} denotes the processor utilization factor of a set of higher priority tasks than task sj, which are scheduled to run before task sj processes ICJi,j frame. A task set Tset contains a collection of periodic task set TsetP and aperiodic task set TsetAP. Let L be the cardinal of Tset. In Eq. (1), HP{sj} denotes a set of higher priority tasks than task sj. In order to get HP{qj}, we exploit three known parameters (i.e., Tel, Tpl, and Tal) and one unknown variable (i.e., HP{sj})

ICJrespi;j ¼

task sl precedes task sj and so task sl is included in ½HPfsj gT-RM ICJ-EDF . In the second condition, Trl  Trj indicates that the release time of aperiodic task Trl is faster than that of aperiodic task Trj and so task sl is included in ½HPfsj gT-RM ICJ-EDF . In case of the T-RM scheduler, we assume that aperiodic tasks are scheduled in the background when there are no periodic tasks ready to execute. Thus, any periodic task sl precedes any aperiodic task sj. Note that periodic ICJ frames will be processed by periodic tasks and aperiodic ICJ frames will be processed by aperiodic or periodic tasks when the T-RM is executed 8 !! > < X ICJdl;j 1 ICJi;j T-RM ½k ICJ-EDF ¼  min ;1 > ICJdi;j :ðl2TsetP Þ^Pðl–jÞ ICJpl;j 9 !!> = X ICJdl;j ICJal;j  min ;1 ; ð4aÞ þ P > ICJdi;j ; ðl2Tset AP Þ^ ðl–jÞ !! X ICJdi;j  ICJdl;j 1  max ;0 þ ICJrespi;j  P ICJpl;j ICJdi;j ðl2Tset P Þ^ ðl–jÞ !! X ICJdi;j  ICJdl;j ICJal;j  max ;0 þ ICJrespi;j  P ICJdi;j

ð2Þ

Eq. (2) estimates ICJrespi,j using the outcome of Eq. (1). In Eq. (2), we assume that there exist k ICJ frames including ICJi,j frame that still remains during ICJrespi,j period. Now we show that the proposed ICJ-EDIT scheduler can achieve better performance than the other ICJ schedulers in terms of the response time of ICJ frames. To derive the ICJrespi,j period, we need to decide the value of unknown variables, which are k and HP{qj} described in Eq. (2). Sequentially, we need to decide the value of unknown variable HP{sj} to derive HP{qj} described in Eq. (1). By Eqs. (3), (5), (7), and (8), we show how to derive HP{sj}. We also show how to derive unknown variable k in Eqs. (4-a)–(4-c). Due to limited page space, we will compare the ICJ-EDIT scheduler with the ICJ-EDF scheduler in case that either the T-RM scheduler or the T-EDF scheduler is executed

½HPfsj gTRM ICJEDF    9 8   > > = <  if ðICJ i;j 2 ICJset P Þ then ðsl 2 Tset P Þ ^ sl  sj  ¼ sl  if ðICJ i;j 2 ICJset AP Þ  >    >  ; :  then ðsl 2 Tset P Þ _ Trl  Trj ^ ðsl 2 TsetAP Þ for all l ¼ 1; 2; . . . L ð3Þ In Eq. (3), we derive ½HPfsj gT-RM ICJ-EDF which stands for a set of higher priority tasks than task sj when the T-RM scheduler and the ICJEDF scheduler run. Eq. (3) consists of two parts. When the ICJi,j frame is periodic, the first condition is considered or otherwise the second condition is considered. In the first condition, sl  sj indicates that

! ICJdl;j 1 min ;1 ¼ P ICJpl;j ICJdi;j ðl2Tset P Þ^ ðl–jÞ !!! ICJdi;j  ICJdl;j ;0 þ ICJrespi;j  max ICJdi;j ! X ICJdl;j þ ICJal;j min ;1 ICJdi;j ðl2Tset AP Þ^ðl–jÞ !!! ICJdi;j  ICJdl;j ;0 þ ICJrespi;j  max ICJdi;j

ð4bÞ

X

ð4cÞ

Second, Eqs. (4a)–(4c) estimate the average number of ICJ frames which are forwarded to task sj during the ICJrespi,j period. The averICJ age number of ICJ frames is denoted by k i;j : We can replace variable k used in Eq. (2) with variable ICJ ½k i;j T-RM ICJ-EDF denoted in Eq. (4a). Then we estimate the value of ICJrespi,j using ½HPfsj gT-RM ICJ-EDF derived from Eq. (3) for given T-RM and ICJ-EDF schedulers. Eqs. (4a) and (4b) count the average number of ICJ frames with earlier deadline than ICJi,j frame: the ICJ frame with the earliest deadline will be processed at the highest priority. Eq. (4a) estimates ICJ frames that have already arrived at the ICJ Queue before the ICJi,j frame arrives at the ICJ Queue. Eq. (4a) can be easily derived by Little’s Theorem. 1/ICJpi,j and ICJai,j denote the average arrival rate of periodic and aperiodic ICJ frames that have already arrived at the observed time, respectively. ICJdl,j/ICJdi,j denotes the probability that the other ICJdl,j frames have earlier deadline than ICJi,j frame. After the ICJi,j frame releases, Eq. (4b) counts the additional average number of higher priority ICJl,j frames, which arrive at the ICJ Queue, than ICJi,j frame

½HPfsj gT-RM ICJ-EDIT 9 8   if ðICJ i;j 2 ICJset P Þ > > > > > >  ! > > > >  > > s 2 Tset Þ^ ð P l > > > >  then > >  > > = <  ðsl  sj Þ ^ ðTpl  minðICJpi;j ; Tpj ÞÞ  ¼ sl  > > >  if ðICJ i;j 2 ICJset AP Þ > > > > !> > >  > > > > s 2 Tset Þ_ ð P l > >  > > > then >  :  ððTr l  minðICJr i;j ; Trj ÞÞ ^ ðsl 2 Tset AP ÞÞ ; for all l ¼ 1; 2; . . . ; L

ð5Þ

1593

Y. Kim et al. / The Journal of Systems and Software 82 (2009) 1588–1599 Table 1 TCP/IP features implemented by the task-based decomposition. Feature

Feature

ARP Cache ARP Update IP Reassemble IP/TCP Options TCP Timers TCP Out-of-Sequence Data

Yes Yes Yes No Yes No

Periodic S/W Task (Release Time, Period, Execution time, Deadline)

Application Tasks

Feature

TCP Multi-Connections Variable TCP MSS TCP RTT Estimation TCP Active/Passive Open TCP Sliding Window User-level TCP Retransmission

Yes No Yes Yes No Yes

Aperiodic S/W Task (Release Time, Average Release Rate, Execution time, Deadline)

Aperiodic H/W Task (Release Time, Average Release Rate, Execution time, Deadline)

TCP Delayed ACK TCP Congestion Control TCP Flow Control TCP Urgent Data IP/UDP/TCP Checksum Socket Buffer

No No Yes Yes Yes Yes

ICJ (Release Time, Period, Execution Time, Deadline, Drop Policy)

n APP-TX Task

13. EXP-TX Task (t1, 20ms, 1ms, 20ms) ETCD =

APP-RX Task

14. PP-TX Task (t1, 0.005/sec, 1ms, 20ms) ETCD =

15. AUDIO-TX Task (t1, 25ms, 2.5ms, 25ms) ETCD = 150ms

ICJ (t1, 20ms, 155us, ICJ (t1, 0, 155us, ICJ (t1, 25ms, 155us, D4, No-Drop) D4, Drop) D4, No-Drop)

16. VIDEO-TX Task (t1, 66ms, 6.6ms, 66ms) ETCD = 250ms

ICJ (t1, 66ms, 155us, D4, Drop)

t1 Socket API

Network Protocol Tasks 1. TCP-RX-SESSION (t10, 0, 133us, )

t10

ICJ (t10, 0, 133us, D1, Drop Policy)

3. TCP-TX-SESSION (t1, 0, 134us, )

2. TCP-RX (t10, 0, 153us, )

ICJ (t10, 0, 153us, D2, Drop Policy)

ICJ (t2, 0, 127us, ICJ (t2, 0, 127us, D6, Drop Policy) D6, Drop Policy)

4. TCP-TX (t1, 0, 155us, ) ICJ (t2, 0, 127us, D6, Drop Policy)

t2 5. IP-RX (t9, 0, 125us,

t9

ICJ (t8, 0, 754us, D7, Drop Policy) 9. PHY-RX (t7, 0, 1065us,

t7 = (t6+NPD)

6. IP-TX (t2, 0, 127us,

)

8. CHK-TX (t3, 0, 850us, ICJ (t4, 0, 251us, D10, Drop Policy)

ICJ (t4, 0, ) 139us, , Drop Policy)

t3 )

t4

10. ARP-MIB (t4, 0, 251us,

)

ICJ (t5, 0, 121us, D11, Drop Policy)

ICJ (t7, 0, 1065us, D9, Drop Policy)

Ethernet Interrupt Handler

)

ICJ (t3, 0, 850us, D8, Drop Policy)

ICJ (t9, 0, 125us, D5, Drop Policy) 7. CHK-RX (t8, 0, 754us,

t8

)

11. ARP-RR (t5, 0, 121us,

t5 )

ICJ (t6, 0, 1002us, D12, Drop Policy) 12. PHY-TX (t6, 0, 1002us,

t6 )

SoC-Platform-A

LN91C111 Ethernet Chip

Networks [NPD (Network Propagation Delay): 130ms]

SoC-Platform-B for Performance Measurement

Fig. 4. A set of task features used in experiments.

Just as described in Eq. (3), Eq. (5) describes how to derive the T-RM ½HPfsj gT-RM ICJ-EDIT :½HPfsj gICJ-EDIT represents a set of high-priority tasks than task sj when T-RM and ICJ-EDIT schedulers are executed. The

ICJ-EDIT scheduler is applied to Eq. (5). So when the period ICJpi,j and release time ICJri,j of ICJi,j frame are earlier than those of task sj handling ICJi,j frame, they will be inherited to the task sj.

Y. Kim et al. / The Journal of Systems and Software 82 (2009) 1588–1599

In case of the periodic ICJi,j frame, Tpl  min(ICJpi,j, Tpj) indicates that the task sl is included in ½HPfsj gT-RM ICJ-EDIT if the period Tpl of periodic task sl is shorter than the minimum value between the period ICJpi,j of ICJi,j frame and the period Tpj of periodic task sj. Besides, in case of the aperiodic ICJi,j frame, Trl  minðICJri;j ; Tr j Þ indicates that the aperiodic task sl is included in ½HPfsj gT-RM ICJ-EDIT if the release time Trl of aperiodic task sl is faster than the minimum value between the release time ICJri,j of ICJi,j frame and the release time Trj of aperiodic task sj. As compared with Eq. (3), ½HPfsj gT-RM ICJ-EDIT of Eq. (5) is of Eq. (3). Consequently, Eq. (1) using contained in ½HPfsj gT-RM ICJ-EDF T-RM HP{sj} can be replaced with either ½HPfsj gT-RM ICJ-EDF or ½HPfsj gICJ-EDIT . In summary, we can derive the following inequality (6): T-RM ½HPfsj gT-RM ICJ-EDF  ½HPfsj gICJ-EDIT T-RM ) ½HPfqj gT-RM ICJ-EDF  ½HPfqj gICJ-EDIT T-RM ) ½ICJrespi;j T-RM ICJ-EDF  ½ICJrespi;j ICJ-EDIT

for all i ¼ 1; 2; . . . ; L

ð6Þ

Eqs. (5) and (6), and inequality (7) are derived in a similar manner as described in Eqs. (3) and (5), and inequality (6) when the T-EDF scheduler and the ICJ-EDIT scheduler are executed

½HPfs

T-EDF j gICJ-EDF

¼

½HPfsj gT-EDF ICJ-EDIT ¼



sl jðTrl þ Tdl Þ  ðTrj þ Tdj Þ



for all l ¼ 1; 2; . . . ; L



sl jðTrl þ Tdl Þ  minðICJri;j þ ICJdi;j ; Trj þ Tdj Þ

for all l ¼ 1; 2; . . . ; L



ð7Þ ð8Þ

T-RM In Eqs. (5) and (6), ½HPfsj gT-EDF ICJ-EDF and ½HPfsj gICJ-EDIT represent a set of higher priority tasks than task sj when the T-EDF/ICJ-EDF scheduler or the T-EDF/ICJ-EDIT scheduler are executed, respectively. As compared with Eqs. (3) and (5), Eqs. (5) and (6) do not consider whether the ICJ frame is periodic or not since the T-EDF scheduler is exploited for a task scheduler. The concept of EDF algorithm is to assign priority tasks according to their absolute deadline: the task with the earliest deadline will be executed at the highest priority. In Eq. (5), ðTr l þ Tdl Þ  ðTr j þ Tdj Þ indicates that task sl with earlier absolute deadline Trl + Tdl than task sj is included in ½HPfsj gT-EDF ICJ-EDF . In Eq. (6), ðTr l þ Tdl Þ  minðICJri;j þ ICJdi;j ; Trj þ Tdj Þ indicates that task sl is included in ½HPfsj gT-EDF ICJ-EDIT if the absolute deadline Trl + Tdl of task sl is earlier than the minimum value between the absolute deadline ICJri,j + ICJdi,j of ICJi,j frame and the absolute deadline Trj + Tdj of task sj. As compared with Eq. (5), the ½HPfsj gT-EDF ICJ-EDIT portion of Eq. (6) is contained in the ½HPfsj gT-EDF ICJ-EDF portion of Eq. (5). Thus, by Eq. (1) incorporating HP{sj} which will be replaced with T-EDF either ½HPfsj gT-EDF ICJ-EDF or ½HPfsj gICJ-EDIT , we can derive the following inequality (7):

T-EDF ½HPfsj gT-EDF ICJ-EDF  ½HPfsj gICJ-EDIT

The complete set of heterogeneous tasks is illustrated in Fig. 4. The system architecture of the SoC-Platform-B is the same as that of the SoC-Platform-A. The TCP/IP protocol layer is decomposed into software and hardware tasks. ARP-RR, IP-TX, IP-RX, CHK-TX, CHK-RX, TCP-TX, and TCP-RX tasks belong to hardware tasks. PHY-TX, PHY-RX, ARP-MIB, TCP-TX-SESSION, and TCP-RX-SESSION tasks belong to software tasks. Among the hardware tasks, the ARP-RR task is incharge of transmitting and receiving ARP (Address Resolution Protocol) Request and Reply packets. IP-TX and IP-RX tasks handle IP packet transmission and reception. CHK-TX and CHK-RX tasks calculate the checksum of IP packets that transmit and receive to/from the lower layer, respectively. TCP-TX and TCP-RX tasks handle the transmission and reception of TCP packets, respectively. Among software tasks, PHY-TX and PHY-RX tasks include the transmission and reception of Ethernet frames. These two tasks correspond to the operations of device driver for the LAN91C111 Ethernet chipset. The ARP-MIB (ARP–Management Information Base) task handles the maintenance of an ARP cache. TCP-TX-SESSION and TCP-RX-SESSION tasks handle the initiation and termination of TCP sessions, respectively. 1200

Average Computation Time (us)

1594

PHY-TX (S/W Task) PHY-RX (S/W Task) CHK-TX (S/W Task) CHK-RX (S/W Task) CHK-TX (H/W Task) CHK-RX (H/W Task)

1000

800

600

400

200

0 50

250

450

650

850

1050

1250

1450

Packet Size (byte) Fig. 5. Average execution time of different PHY and CHK tasks.

Table 2 Average execution time of TCP/IP network protocol tasks. S/W Tasks

Time (ls)

H/W Tasks

Time (ls)

ARP-MIB (S/W Task) ARP-RR (S/W Task) IP-TX (S/W Task) IP-RX (S/W Task) TCP-TX-SESSION (S/W Task) TCP-RX-SESSION (S/W Task) TCP-TX (S/W Task) TCP-RX (S/W Task)

251 121 127 125 134 133 155 153

None ARP-RR (H/W Task) IP-TX (H/W Task) IP-RX (H/W Task) None

99 88 87

TCP-TX (H/W Task) TCP-RX (H/W Task)

94 92

T-EDF ) ½HPfqj gT-EDF ICJ-EDF  ½HPfqj gICJ-EDIT T-EDF ) ½ICJrespi;j T-EDF ICJ-EDF  ½ICJrespi;j ICJ-EDIT

ð9Þ

5. Experiments All the experiments in this paper were carried out completely on the SoC platform based on the Altera’s Excalibur including the ARM922T core and up to 1 million gates of programmable logic. The total execution time of experiments is 10,000 s. The development tools used in this paper are Quartus 4.0 for FPGA synthesis (Altera, 2008) and ADS (ARM Developer Suite) C compiler 1.2 (ARM, 2001). The TCP/IP protocol suite is exploited for the networking protocol stack. Table 1 describes the TCP/IP features implemented in this paper.

2000

Average Transmission Delay (us)

for all i ¼ 1; 2; . . . ; L

uIP RX uIP TX TCP/IP-ST RX TCP/IP-ST TX TCP/IP-MT RX TCP/IP-MT TX

1800 1600 1400 1200 1000 800 600 100

300

500

700

900

1100

1300

1500

Packet Size (byte) Fig. 6. Average application-to-physical layer delay of TCP/IP Protocol.

1595

Y. Kim et al. / The Journal of Systems and Software 82 (2009) 1588–1599 Table 3 A task set used in experimenting deadline miss ratio. NoC-Platform-A for traffic generator

q = 0.4 q = 0.5 q = 0.6 q = 0.7 q = 0.8 q = 0.9

TA1 = {AUDIO-TX: (t1, 25 ms, 2.5 ms, 25 ms), VIDEO-TX: (t1, 66 ms, 6.6 ms, 66 ms), PP-TX: (t1, 50 per second, 1 ms, 20 ms), EXP-TX: (t1, 20 ms, 1 ms, 20 ms)} TA2 = TA1 [ {APP-TX2: (t1, 10 ms, 0.5 ms, 10 ms), APP-RX2: (t1, 10 ms, 0.5 ms, 10 ms)} TA3 = TA2 [ {APP-TX3: (t1, 100 per second, 0.5 ms, 10 ms), APP-RX3: (t1, 100 per second, 0.5 ms, 10 ms)} TA4 = TA3 [ {APP-TX4:(t1, 20 ms, 1 ms, 20 ms), APP-RX4: (t1, 20 ms, 1 ms, 20 ms)} TA5 = TA4 [ {APP-TX5: (t1, 10 ms, 0.5 ms, 10 ms), APP-RX5: (t1, 10 ms, 0.5 ms, 10 ms)} TA6 = TA5 [ {APP-TX6: (t1, 20 ms, 1 ms, 20 ms), APP-RX6: (t1, 20 ms, 1 ms, 20 ms)}

10 8

6 4

2

0

0.4

0.5

0.6

0.7

0.8

8 7 6 5 4 3 2 1 0

0.9

0.4

Processor Utilization by Application Tasks

Average Deadline Miss Ratio (%)

ICJ-RM ICJ-EDF ICJ-EDIT

7 6 5 4 3 2 1 0 0.5

0.6

0.7

0.7

ICJ-FIFO ICJ-DM ICJ-WFQ

8

8

0.4

0.6

0.8

0.9

(a) Task scheduling by T-DM

Average Deadline Miss Ratio (%)

ICJ-FIFO ICJ-DM ICJ-WFQ

9

0.5

Processor Utilization by Application Tasks

(a) Task scheduling by T-RM 10

ICJ-RM ICJ-EDF ICJ-EDIT

ICJ-FIFO ICJ-DM ICJ-WFQ

9

Average Deadline Miss Ratio (%)

Average Deadline Miss Ratio (%)

ICJ-RM ICJ-EDF ICJ-EDIT

ICJ-FIFO ICJ-DM ICJ-WFQ

12

0.8

6 5 4 3 2 1 0.4

0.5

0.6

0.7

0.8

0.9

Processor Utilization by Application Tasks

(b) Task scheduling by T-EDF

Four application tasks are basically executed as follows: a periodic EXP-TX task, an aperiodic PP-TX task, a periodic AUDIO-TX task, and a periodic VIDEO-TX task. The EXP-TX task which plays a role of an exponential traffic generator generates and transmits packets whose lengths are exponentially distributed with mean of 1450 bytes. The PP-TX task generates a packet stream whereby packets are transmitted according to a Poisson process with 50 packets per second and a fixed packet length of 1450 bytes. As experiment environments, we assume that the proposed SoC platform considers being used for mobile multimedia communication systems. According to Refs. ISO/IEC 14496-2:2001 (2001), Nahrstedt and Steinmetz (1995) AUDIO-TX and VIDEO-TX tasks are defined as follows: the operation features of AUDIO-TX task contains the maximum bit rate of 64 kbps, the sampling rate of 44.1 kHz, the execution time of 2.5 ms, and the audio packet size of 208 bytes which leads to the period of 25 ms. The operation features of VIDEO-TX task contain the maximum bit rate of 384 kbps, 15 video frames per second, and the frame size of 2000 bytes which will equally break into two packets and so will transmit 30 packets per second. It leads to the execution time of 6.6 ms with the period

ICJ-RM ICJ-EDF ICJ-EDIT

0

0.9

Processor Utilization by Application Tasks

Fig. 7. Average deadline miss ratio of network protocol tasks running on the TCP/IP-ST.

7

(b) Task scheduling by T-EDF Fig. 8. Average deadline miss ratio of network protocol tasks running on the TCP/ IP-MT.

of 66 ms. The execution time of all other tasks denotes the worstcase execution time we have measured in experiments. The value of Drop Policy used in tasks depends on the drop policy field contained the ICJ frame. The ETCD (End-to-end Task Communication Deadline) illustrated in Fig. 4 represents the communication delay between distant tasks over networks with the assumed NPD (Network Propagation Delay) of 130ms. An ICJ frame inherits its value of period time from the task that transmits the ICJ frame. The execution time of ICJ frame is set the same as that of the task that will receive directly the ICJ frame. The deadline of ICJ frame, ICJdi,j, is evaluated by Eq. (8)

Tei ICJdi;j ¼ ðETCD  NPDÞ  STe;j þ STe;i X STe;j ¼

Tej

j2tasks processing the downward flow of ICJi;j

STe;i ¼

X

i2tasks processing the upward flow of ICJ i;j

Tei

ð10Þ

1596

Y. Kim et al. / The Journal of Systems and Software 82 (2009) 1588–1599

We assume that the SoC-Platform-A generates and transmits packets to the SoC-Platform-B which receives and processes packets. The ETCD values for audio and video services are 150 ms and 250 ms, respectively. Tej denotes the execution time of each task that receives and processes the ICJi,j frame in the downward flow of ICJ frames. STe,j is the sum of Tej values. Tei is referred as the execution time of each task that transmits and processes the ICJi,j frame in the upward flow of ICJ frames. STe,i is the sum of their values. Fig. 5 shows the average execution time of different PHY and CHK tasks. Table 2 describes the average execution time of TCP/ IP network protocol tasks. Note that the experimental outcomes are represented by the packet size of 1450 bytes. The software PHY-TX task requires the worst execution time since all packets from upper protocol layers tend to flow into the PHY-TX task. Especially, the performance of CHK-TX and CHK-RX tasks in hardware is superior to that of the corresponding CHK-TX and CHK-RX tasks in software. These two hardware tasks exploit parallelism increasing their execution speeds while activating the buffer transfer mode supported by the AHB to enhance the data transfer rate as fast as 32 bytes per single clock edge. In all hardware tasks connecting to the AHB interface, all timing is referenced to a single clock edge to control all their operations. Thus, each of all hardware tasks yields better performance than the corresponding software task. Fig. 6 shows the average packet processing delay from the application layer to the physical layer when three kinds of TCP/IP protocol implementation are evaluated on the SoC platform: well-known uIP for embedded networking systems (Dunkels, 2003), TCP/IP-ST (TCP/IP protocol suite based on only Software network protocol Tasks) developed in this paper according to the specifications of Table 1, and TCP/IP-MT (TCP/IP protocol suite

Average Deadline Miss Ratio (%)

10

ICJ-RM ICJ-EDF ICJ-EDIT

8 6 4 2 0 0.4

0.5

0.6

0.7

ICJ-FIFO ICJ-DM ICJ-WFQ

3.5

Average Deadline Miss Ratio (%)

ICJ-FIFO ICJ-DM ICJ-WFQ

12

based on Mixed software–hardware network protocol Tasks) developed in this paper according to the specifications of Tables 1 and 2. Note that we do not consider the version of TCP/IP-HT (TCP/IP protocol suite based on Hardware network protocol Tasks) according to the following reasons: (1) the performance of TCP/IPMT is very similar to that of TCP/IP-HT and (2) one of the main issues in this paper considers the partition problem of implementing network protocol functions which are realized into a set of task units in hardware and software concurrently running onto a SoC platform. Each of the three TCP/IP implementation versions is evaluated in two points of view: (1) the downward flow of processing packets from higher layers to lower layers as a TX (transmission) point of view and (2) the upward flow of processing packets from lower layers to higher layers as a RX (reception) point of view. In case of the uIP-TX, as the packet size is larger than 700 bytes, the packet fragmentation, which equally breaks the packet into two packets, takes place. Thus it leads to possible performance degradation as shown in Fig. 6. Since the TCP/IP-ST devotes a single software task to each of the protocol layers, the overheads of context switching time and internal packet transfer over the RT-IJC2 lead to the performance degradation. In case of the TCP/IP-MT, it can perform the parallel operations of network protocol tasks implemented in hardware and handle a much higher packet rate than both uIP and TCP/IP-ST that cannot perform parallel operations. Table 3 presents the complete set of application tasks used to measure the deadline miss ratio of application and network protocol tasks in Figs. 7–14. In Table 3, q denotes the ARM922T-processor utilization factor occupied by application tasks. Fig. 7 shows the average deadline miss ratio of network protocol tasks in the TCP/IP-ST. Fig. 8 shows the average deadline miss ratio

0.8

3 2.5 2 1.5 1 0.5 0

0.9

0.4

Processor Utilization by Application Tasks

Average Deadline Miss Ratio (%)

0.6

0.7

0.8

0.9

(a) Task scheduling by T-DM ICJ-FIFO ICJ-DM ICJ-WFQ

4

ICJ-RM ICJ-EDF ICJ-EDIT

Average Deadline Miss Ratio (%)

ICJ-FIFO ICJ-DM ICJ-WFQ

5

0.5

Processor Utilization by Application Tasks

(a) Task scheduling by T-RM

4.5

ICJ-RM ICJ-EDF ICJ-EDIT

4 3.5 3 2.5 2 1.5 1 0.5 0

3.5

ICJ-RM ICJ-EDF ICJ-EDIT

3 2.5 2 1.5 1 0.5 0

0.4

0.5

0.6

0.7

0.8

0.9

0.4

0.5

0.6

0.7

0.8

Processor Utilization by Application Tasks

Processor Utilization by Application Tasks

(b) Task scheduling by T-EDF

(b) Task scheduling by T-EDF

Fig. 9. Average deadline miss ratio of all tasks running on TCP/IP-ST.

0.9

Fig. 10. Average deadline miss ratio of all tasks running on TCP/IP-MT.

1597

Y. Kim et al. / The Journal of Systems and Software 82 (2009) 1588–1599

30

ICJ-FIFO

ICJ-RM

ICJ-DM

ICJ-EDF

ICJ-WFQ

ICJ-EDIT

25 20 15 10 5

Average Deadline Miss Ratio(%)

35

TCP/IP-MT, respectively. In case of the ICJ-DM scheduler, the deadline of VIDEO-TX task can be frequently missed since the priority of AUDIO-TX task is higher than that of VIDEO-TX task. Note that the DM (Deadline Monotonic) algorithm assigns priorities to tasks and ICJ frames to their deadlines according to the following policy: the task and the ICJ frame with the shortest relative deadline are assigned the highest priority, respectively. As specified in Table 3, the deadline of AUDIO-TX task is 25 ms and that of VIDEO-TX task is 66 ms. In case of the ICJ-EDIT scheduler, it attempts to check the schedulability and acceptance test of its execution before it inherits the earliest ICJ deadline to the deadline of the task that will process the ICJ frame. The ICJ-EDIT scheduler achieves at least 5.8% and average 45.4% performance improvement over all other ICJ schedulers in terms of average deadline miss ratio. As shown in Fig. 10, the co-synthesis approach of TCP/IP-MT can reduce the average deadline miss ratio to 22.8% less than the TCP/IP-ST. Figs. 11 and 12 show the average deadline miss ratio of individual network protocol tasks in the TCP/IP-ST and the TCP/IP-MT, respectively, when q is equal to 0.9. What happens commonly to these experiments is that there are bottlenecks at the TCP-TX task, which is mainly devoted to the packet transmission entry point from application tasks, and the PHY-RX task, which is mainly activated as the packet reception entry point from the LAN91C111 Ethernet chipset, respectively. The ICJ-EDIT scheduler achieves better performance than other ICJ schedulers regardless of the type of task scheduler. If a set of tasks and ICJ frames can be schedulable by the ICJ-EDIT scheduler, the earliest deadline of ICJ frame will be inherited to the task which needs to handle the ICJ frame. Then the task can preempt the current software task running on the processor and execute as fast as possible. Network protocol tasks in

ICJ-FIFO

ICJ-RM

35

ICJ-DM

ICJ-EDF

ICJ-WFQ

ICJ-EDIT

30 25 20 15 10 5

YR PH X YTX A RP -M A IB RP -R R IP -R X IP -T X CH K -R CH X K -T TC X PR TC TC X PPTX TX TC -S E PTX SSI -S ON ES SI O N

X YTX A RP -M A IB RP -R R IP -R X IP -T X CH K -R CH X K -T TC X PR TC TC X PPT TX TC X-S ES PTX SI -S ON ES SI O N

PH

PH

PH

40

0

0

YR

Average Deadline Miss Ratio(%)

of network protocol tasks in the TCP/IP-MT. Tasks are scheduled by T-RM, T-DM, and T-EDF schedulers. ICJ frames are scheduled by ICJ-FIFO, ICJ-WFQ, ICJ-RM, ICJ-DM, ICJ-EDF, and ICJ-EDIT schedulers. In these experiments, the ICJ-EDIT scheduler achieves at least 13.7% and average 57.9% performance improvement over all other ICJ schedulers (ICJ-WFQ, ICJ-FIFO, ICJ-RM, ICJ-DM, and ICJ-EDF schedulers) in terms of average deadline miss ratio. The performance of the ICJ-WFQ scheduler is worse than that of other ICJ schedulers since the ICJ-WFQ scheduler does not consider the priority and deadline of each ICJ frame. Besides, it is worse than the ICJ-FIFO scheduler. In case of the ICJ-RM scheduler, the deadline of AUDIO-TX and VIDEO-TX tasks can be frequently missed since the priority of AUDIO-TX and VIDEO-TX tasks is lower than that of the EXP-TX task. The ICJ-EDF scheduler considering the deadline of each ICJ frame wins the second place in the performance measurement. As shown in Figs. 7 and 8, the co-synthesis of the TCP/IP-MT can reduce the average deadline miss ratio to 30.7% less than those implemented in software due to less context switching, hardware parallelism, and burst I/O transfer exploited in hardware tasks. Especially, the task scheduled by the T-EDF scheduler incorporating the ICJ-EDIT scheduler achieves better performance than other ICJ schedulers. Even if the processor utilization is increased, the performance of the ICJ-EDIT scheduler is still better than other ICJ schedulers since the ICJ-EDIT scheduler continuously attempts to inherit the earliest deadline of one ICJ frame that processes the corresponding task deadline while keeping the deadline of other ICJ frames. Figs. 9 and 10 show the average deadline miss ratio of all network and application tasks running on the TCP/IP-ST and the

(a) Task scheduling by DM

20 15 10 5

X YTX A RP -M A IB RP -R R IP -R X IP -T X CH K -R CH X K -T TC X PR TC TC X PPT TX TC X-S ES PTX SI -S ON ES SI O N

0

PH

PH

(b) Task scheduling by EDF Fig. 11. Average deadline miss ratio of individual network tasks in TCP/IP-ST where q = 0.9.

ICJ-RM

ICJ-DM

ICJ-EDF

ICJ-WFQ

ICJ-EDIT

14 12 10 8 6 4 2 0

X YTX A RP -M A IB RP -R R IP -R X IP -T X CH K -R CH X K -T TC X PR TC TC X PPT TX TC X-S ES PTX SI -S ON ES SI O N

ICJ-EDIT

ICJ-FIFO

18 16

YR

ICJ-WFQ

20

PH

ICJ-RM ICJ-EDF

Average Deadline Miss Ratio(%)

25

ICJ-FIFO ICJ-DM

PH

30

YR

Average Deadline Miss Ratio(%)

(a) Task scheduling by RM

(b) Task scheduling by EDF Fig. 12. Average deadline miss ratio of individual network tasks in TCP/IP-MT where q = 0.9.

6000

Average Number of Context Switching per Second

Y. Kim et al. / The Journal of Systems and Software 82 (2009) 1588–1599

Response Time of ICJ Frame (us)

1598

Processor Utilization

5000

0.4 0.6 0.8 1

4000 3000

0.5 0.7 0.9

2000 1000 0

PHY-TX

ARP-MIB

CHK-TX

IP-TX

Response Time of ICJ Frame (us)

0.4 0.6 0.8 1

4000 3000

0.5 0.7 0.9

2000 1000 0

PHY-TX

Response Time of ICJ Frame (us)

400 300 200 100 0 0.4

0.5

0.6

0.7

0.8

0.9

1

Processor Utilization by Application Tasks Fig. 14. Average number of context switching of TCP/IP protocol tasks over varying processor utilization offered by application tasks.

Processor Utilization

5000

ARP-MIB

CHK-TX

IP-TX

TCP-TX

(b) Average response time of transmitting audio/video ICJ frames by T-EDF and ICJ-EDIT scheduler 14000

Processor Utilization

12000

0.4 0.6 0.8 1

10000 8000

0.5 0.7 0.9

6000 4000 2000 0

TCP-RX

CHK-RX

IP-RX

PHY-RX

(c) Average response time of receiving audio/video ICJ frames by T-EDF and ICJ-EDF scheduler

Response Time of ICJ Frame (us)

Context Switching of TCP/IP-ST Context Switching of TCP/IP-MT

500

TCP-TX

(a) Average response time of transmitting audio/video ICJ frames by T-EDF and ICJ-EDF scheduler 6000

600

14000

Processor Utilization

12000 10000 8000

0.4 0.6 0.8 1

0.5 0.7 0.9

of ICJ-WFQ, 5.7% of ICJ-DM, 5.6% of ICJ-DM, 5.4% of ICJ-EDF, and 2.1% of ICJ-EDIT. Network protocol tasks in the TCP/IP-MT yield the following average deadline miss ratio in case that each of the ICJ schedulers is used: 4.9% of ICJ-FIFO, 5.3% of ICJ-WFQ, 2.8% of ICJ-DM, 2.9% of ICJ-DM, 2.2% of ICJ-EDF, and 1.0% of ICJ-EDIT. Fig. 13 shows the average response time of transmitting and receiving ICJ frames containing audio and video packets in the TCP/IP-MT. In this experiment the T-EDF scheduler is used as a task scheduler and either ICJ-EDF or ICJ-EDIT scheduler is used as an ICJ scheduler. Just as illustrated in Fig. 11, both the TCP-TX task leading the packet transmission entry point and the PHY-RX task leading the packet reception entry point have the longest average response time of processing ICJ frames. The average response time of the ICJ-EDIT scheduler improves much better as compared with that of the ICJ-EDF scheduler. For example, the average response time of the TCP-TX task running on the ICJ-EDF scheduler is 5166 ls, however, it is 1600 ls in case of the ICJ-EDIT scheduler. Overall, the ICJ-EDIT scheduler in the TCP/IP-MT reduces the average response time of processing ICJ frames by 11.09% less than the ICJ-EDF scheduler. In case of the PHY-RX, the ICJ-EDIT scheduler in the TCP/IP-MT reduces the average response time of processing ICJ frames by 54.41% less than the ICJ-EDF scheduler. The outcome comparison illustrated in Fig. 13 empirically verifies the theoretical outcome presented in Section 4. Fig. 14 shows the average number of context switching of TCP/ IP protocol tasks over varying offered application processor utilization. Since the TCP/IP-MP can perform the parallel operations of network protocol tasks in hardware without any context switching, the TCP/IP-MT can reduce the average number of 70 context switching less than the TCP/IP-ST. When q reaches 1, both the TCP/IP-MT and the TCP/IP-ST yield the same average number of context switching because the ARM922T-processor has already been overloaded with the full execution of software-based application tasks. 6. Conclusion

6000 4000 2000 0

TCP-RX

CHK-RX

IP-RX

PHY-RX

(d) Average response time of receiving audio/video ICJ frames by T-EDF and ICJ-EDIT scheduler Fig. 13. Average response time of ICJ frames in TCP/IP-MT.

the TCP/IP-ST yield the following average deadline miss ratio in case that each of the ICJ schedulers is used: 6.7% of ICJ-FIFO, 7.4%

The main issue of this paper is to resolve the hardware–software co-design issue of implementing network protocols on the SoC platform which considers the application-specific real-time requirements for given multimedia/networking applications. We address this issue as a partition problem of implementing network protocol functions into dynamically reconfigurable hardware and software tasks. After we apply the ITC technique incorporating the RT-IJC2 into the SoC platform with the full specifications of TCP/IP protocol suite, we verify and validate the performance of SoC platform. Experimental results indicate that the proposed technique efficiently supports high performance in terms of the minimal average processor utilization and the minimal number

Y. Kim et al. / The Journal of Systems and Software 82 (2009) 1588–1599

of context switching, and achieves application-specific real-time and QoS constraints in terms of the minimal deadline miss ratio. Acknowledgement This work has been supported by research fund from the Korean Land Spatialization Group hosted by the Ministry of Construction and Transportation in Korea. References Altera, 2007. Altera Excalibur EPXA4 device. . Altera, 2008. Quartus development software. . ARM, 2001. Arm Developer Suite(ADS) 1.2. . Banerjee, S., Bozorgzadeh, E., Dutt, N.D., 2006. Integrating physical constraints in HW–SW partitioning for architectures with partial dynamic reconfiguration. IEEE Transactions on Very Large Scale Integration Systems 14 (11), 1189–1202. Bolotin, E., Cidon, I., Ginosar, R., Kolodny, A., 2004. QNoC: QoS architecture and design process for network on chip. Journal of Systems Architecture 50 (2–3), 105–128. Cottet, F., Delacroix, J., Kaiser, C., Mammeri, Z., 2002. Scheduling in Real-time Systems. John Wiley. Dollas, A., Ermis, I., Koidis, I., Zisis, I., Kachris, C., 2005. An open TCP/IP core for reconfigurable logic. In: Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, Washington, DC, USA, pp. 297–298. Dunkels, A., 2003. Full TCP/IP for 8-bit architectures. In: Proceedings of the 1st international Conference on Mobile Systems, Applications and Services, New York, NY, USA, pp. 85–98. Gupta, R.K., Coelho, C.N., Micheli, G.D., 1994. Program implementation schemes for hardware–software systems. IEEE Computer 27 (1), 48–55. Gauthier, L., Yoo, S., Jerraya, A.A., 2001. Automatic generation and targeting of application specific operating systems and embedded systems software. In: Proceedings of the Conference on Design, Automation and Test in Europe, Munich, Germany, pp. 679–685. ISO/IEC 14496-2:2001, 2001. Coding of audio–visual objects – Part 2: Visual, second ed. Labrosse, J.J., 2002. Micro C/OS-II: the Real-Time Kernel 2nd, CMP Media. Lagnese, E.D., Thomas, D.E., 1991. Architectural partitioning for system level synthesis of integrated circuits. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 10 (7), 847–860. Lahiri, K., Raghunathan, A., Lakshminarayana, G., Dey, S., 2004. Design of highperformance system-on-chips using communication architecture tuners. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 23 (5), 620–636. Lee, D., Kim, Y., Tak, S., 2007. A study on SoC platform design supporting dynamic cooperation between hardware and software modules. MultiMedia Society 10 (11), 1446–1459. Lee, J., Mooney, V.J., Daleby, A., Ingstrom, K., Klevin, T., Lindh, L., 2003. A comparison of the RTU hardware RTOS with a hardware/software RTOS. In: Proceedings of Asia and South Pacific Design Automation Conference, Kitakyushu, Japan, pp. 683–688.

1599

Lofgen, A., Lodesten, L., Sjoholm, S., 2005. An analysis of FPGA-based UDP/IP stack parallelism for embedded ethernet connectivity. In: Proceedings of Norchip Conference, Oulu, Finland, pp. 94–97. Mogul, J.C., 2003. TCP offload is a dumb idea whose time has come. In: Proceedings of the 9th Conference on Hot Topics in Operating Systems, Berkeley, CA, USA, vol. 9, pp. 5–10. Mooney, V.J., Blough, D.M., 2002. A hardware–software real-time operating system framework for SoCs. IEEE Design and Test of Computers 19 (6), 44–51. Nahrstedt, K., Steinmetz, R., 1995. Resource management in networked multimedia systems. IEEE Computer 28 (5), 52–63. Nakano, T., Komatsudaira, Y., Shiomi, A., Imai, M., 1999. Performance evaluation of STRON: a hardware implementation of a real-time OS. IEICE Transactions on Fundamentals of Electronics, Communications and Computer E82-A, 2375– 2382. Olson, J.T., Rozenbit, J.W., Talarico, C., Jacak, W., 2007. Hardware/software partitioning using bayesian belief networks. IEEE Transactions on Systems, Man and Cybernetics, Part A 37 (5), 655–668. Panic, G., Dietterle, D., Stamenkovic, Z., Tittelbacj-Helmrich, K., 2003. A system-onchip implementation of the IEEE 802.11a MAC layer. In: Proceedings of the Euromicro Symposium on Digital Systems Design, Washington, DC, USA, pp. 319. Paulin, P.G., Pilkington, C., Langevin, M., Bensoudane, E., Lyonnard, D., Benny, O., Lavigueur, B., Lo, D., Beltrame, G., Gagne, V., Nicolescu, G., 2006. Parallel programming models for a multiprocessor SoC platform applied to networking and multimedia. IEEE Transactions on Very Large Scale Integration Systems 14 (7), 667–680. Reginer, G., 2004. TCP on loading for data center servers. IEEE Computer 37 (11), 48–58. SMSC Corporation, 2008. SMSC LAN91C111. .

Youngmann Kim is a doctoral student in the School of Computer Science and Engineering at Pusan National University. His research interests include real-time embedded systems, wireless networking, and SoC (System on Chips)-based network processor Design. E.K. Park is a Professor of Computer Science at the University of Missouri at Kansas City. He received a PhD degree in Computer Science from the Northwestern University. His research interests include software engineering, software architectures, software agents, distributed systems, object-oriented methodology, software tolerance and reliability, computer networks and management, optical networks, database/data mining, numerical computing, optimizations, and information/ knowledge management. Currently, he is on an assignment serving as a Program Director, Division of Computing and Communications Foundations at US National Science Foundation. Sungwoo Tak is an associate professor in the School of Computer Science and Engineering at Pusan National University. He is also a research member at Research Institute of Computer Information and Communication at Pusan National University. He received a Ph.D. degree in Computer Science from the University of Missouri – Kansas City. His research interests include computer networks, wireless networks, software architecture, WDM optical networks, real-time systems, game theory, and SoC (System on Chips)-based network processor Design. He is the corresponding author of this paper.