Design and implementation of reconfigurable and flexible test access mechanism for system-on-chip

Design and implementation of reconfigurable and flexible test access mechanism for system-on-chip

ARTICLE IN PRESS INTEGRATION, the VLSI journal 40 (2007) 149–160 www.elsevier.com/locate/vlsi Design and implementation of reconfigurable and flexible...

260KB Sizes 0 Downloads 26 Views

ARTICLE IN PRESS

INTEGRATION, the VLSI journal 40 (2007) 149–160 www.elsevier.com/locate/vlsi

Design and implementation of reconfigurable and flexible test access mechanism for system-on-chip Zahra S. Ebadia, Alireza N. Avanakib, Resve Saleha,, Andre Ivanova a

Department of Electrical and Computer Engineering, University of British Columbia, 2356 Main Mall, Vancouver, BC, Canada V6T 1Z4 b Department of Electrical and Computer Engineering, University of Tehran, P.O. Box 14395-515, Tehran, Iran

Abstract One of the difficult problems facing core-based system-on-chip (SoC) designs is test access. For testing the cores in an SoC, a special mechanism is required since they are not directly accessible via chip inputs and outputs. In this paper, we introduce a novel test access mechanism (TAM) based on time-division multiplexing (TDM). This so-called TDM-TAM is P1500 compatible and, in fact, uses a P1500 wrapper. The TAM advantages are its flexibility, scalability, and reconfigurability. Also this TAM could be very useful for testing multi-frequency cores in SoCs because, with TDM, the test data rate can be changed. The proposed TAM is compared with two other approaches: a serial threading approach analogous to the IEEE1149.1 standard (Serial TAM) and a packet-switching test network (called NIMA). A network processing unit is used as an SoC platform to compare the different TAMs. Results show that, in most cases, TDM is the most effective TAM in both test time and overhead area. r 2006 Elsevier B.V. All rights reserved. Keywords: SoC testing; Embedded core testing; Test access mechanism; Time-division multiplexed TAM; Optimal test time

1. Introduction Reusable embedded cores are now being employed in large system-on-chip (SoC) designs. They facilitate design and lead to shorter product development cycles. However, the manufacturing tests and debug of SoCs are major challenges. Since cores in an SoC are not directly accessible via chip inputs and outputs, special access mechanisms are required to test them at the system level. Zorian et al. [1] proposed a generic conceptual test access architecture for embedded cores, with the following components: source, sink, and test access mechanism (TAM) (Fig. 1). The TAM is used to deliver test stimuli from the source (which generates test stimuli) to the cores, and also to deliver responses from cores to the sink (which evaluates test responses). There are a number of proposed TAM architectures in the literature. These architectures can be broadly classified into four groups: (1) multiplexing, (2) serial connection, (3) test network-on-chip (NoC) and (4) bus-based connection. Corresponding author. Tel.: +1 604 822 3702; fax: +1 604 822 9506.

E-mail address: [email protected] (R. Saleh). 0167-9260/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.vlsi.2006.02.002

In the first category, multiplexing is used to access the cores. The simplest method in this category directly multiplexes the test pins to the primary inputs/outputs (I/O) [2]. Another method modifies the cores such that each core has a transparent mode for testing [3]. There are several problems with the multiplexing TAM methods that limit the scope of use for future complex SoCs, such as large overhead area for decoding and multiplexing, long test times, and non-scalability of the architecture. TAMs in the serial connection category [4–6] use the established IEEE 1149.1 standard. For a few cores on an SoC, it may be possible to spend time transporting the test vectors serially to the cores. However, as the number and complexity of the cores increases, a serial solution based on the IEEE 1149.1 standard or its variants will prove too costly in terms of test time. A more recent approach is to use a test NoC for delivery and collection of test data. A Networked Indirect and Modular Architecture (NIMA) for TAM was proposed in [7], where emphasis is placed on modularity, generality, and configurability of the architecture to exploit the advantages offered by the reuse paradigm.

ARTICLE IN PRESS Z.S. Ebadi et al. / INTEGRATION, the VLSI journal 40 (2007) 149–160

150

ROM

PCI DRAM

UART

Source

TAM

UDL

Core Under Test Wrapper

MPEG

TAM

Sink

CPU

SoC

Source: Zorian 99

Fig. 1. Conceptual TAM architecture.

A number of different variations of the bus-based connection schemes for TAM have been proposed. For example, [8–10] represent this type of TAM connection. In terms of trading-off increased overhead area for reduced test access time, bus-based architectures are the most efficient TAM schemes suggested to date. In this paper, we present a novel bus-based TAM, called TDM-TAM, which not only has all the advantages of common bus-based TAMs such as scalability, efficiency in time and area, but it is also flexible, reconfigurable, and it handles cores with built-in self-test (BIST) efficiently and needs less global routing than common bus-based approaches. Using time-division multiplexing (TDM), our proposed TAM efficiently employs test buses and achieves a smaller test time. Also, we show that using TDM-TAM gives flexibility to the tester to change the test scheduling and strategy after fabrication. In addition, TDM-TAM can be very useful for testing multi-frequency cores in SoCs, for which it is desirable to test each core at its own speed. Current bus-based TAMs send test data for all cores at a single rate, so for testing multi-frequency SoCs, a complicated automatic test equipment (ATE) set up is required. Apparently, a TAM capable of testing cores at different rates can save a lot in test cost. In Section 2.4 this TDM-TAM advantage is addressed in detail. The proposed TAM is compared with two other TAM architectures for SoC design. The remainder of this paper is organized as follows: in Section 2, the basic concept of TDM-TAM is discussed and test time and area models are derived. In Section 3, TAM optimization for TDM-TAM are reported. In Section 4, the experiments and the results for a platform SoC are described. Section 5 concludes the paper with some conclusions and a few directions for future work. 2. Time-division multiplexed TAM

Our TDM-TAM is a bus-based TAM that uses logic situated locally at each core to enable or disable the core, such that bus contention cannot occur. This architecture eliminates the necessity for arbitration and global address lines, which are normally required in a bus-based architecture. In our TDM-TAM, the data to be sent on the TAM bus is divided into frames. Each core is assigned a specific mask enabling the cores to extract the appropriate data bits from its associated frames. For example, in Fig. 2, a frame is assumed to consist of four bits. Core A uses the first bit of each frame, core B uses bits 2 and 4, and core C uses bit 3. For this specific configuration, assume that we wish to send the data ‘‘11’’ to core A, ‘‘00’’ to core B and ‘‘don’t care’’ bits ‘‘XX’’ to core C. The two required frames to be sent on the TAM bus in this case must, therefore, be ‘‘10X0’’, followed by ‘‘1XXX’’. The corresponding masks for each core are also shown in Fig. 2. Optimal frame and mask assignment is critical for the efficiency of the scheme. This topic is addressed later in this paper. In our case, to implement the TDM-TAM, a P1500 wrapper is used to wrap each core. P1500 for embedded core test [12] is an IEEE standard under development consisting of two components, a Core Test Language and a Core Test Wrapper. The P1500 wrapper of Fig. 3 includes (1) a wrapper instruction register (WIR) for loading wrapper instructions to control the wrapper operation, (2) wrapper boundary cells used at each terminal, (3) a bypass register for the serial interface layer (SIL), and (4) a wrapper interface port (WIP) for controlling wrapper registers via SIL. WIP signals include four wrapper control signals (captureWIR, shiftWIR, UpdateWIR and selectWIR), as well as the wrapper clock (WRCK) and wrapper reset (WRSTN) signals. The P1500 wrapper connects to one mandatory one-bit wide TAM, i.e., the wrapper serial input/output (WSI/ WSO), and zero or more scalable-width TAMs (TAM-in/ TAM-out). A minimal compliant implementation has only the single-bit TAM along with both test control values for the WIR. Envisaged typical usage has one multi-bit TAM next to the mandatory TAM. In this case the bulk test data Frame 1 A

B

C

Frame 2 B

A

B

C

Frame 3 B

A

B

Core A mask:1000

C

B

Core B mask:0101

Core C mask:0010

2.1. Overview TDM is a technique used for transmission of several pieces of data (either digital or analog signals) over a communication channel by dividing the time frame into slots, one slot for each piece of data [11]. We use this approach to send/collect test data across a test bus.

A

B

C

B

Frame Fig. 2. Example illustrating the time-division multiplexing (TDM) TAM concept.

ARTICLE IN PRESS Z.S. Ebadi et al. / INTEGRATION, the VLSI journal 40 (2007) 149–160

access is performed along the multi-bit TAM, while the single-bit TAM is used to program the WIR and possibly transport test data in a silicon debug scenario. TAM-in and TAM-out need not to be the same width. In the standard 1149.1 test access port (TAP) controller, a 16-bit finite state machine (FSM) generates the wrapper control signals from the serial test mode select (TMS) bit stream. For our TDM-TAM, we transformed the original 1149.1 TAP controller into a TDM-TAP controller as illustrated in Fig. 4. This controller includes the original 1149.1 FSM as well as some minimal extra logic, referred to as a time-division multiplexer (TDM) block, which creates the Enable signal for the core according to a preassigned mask. The TDM block operates as follows: when the Reset signal goes low, the mask is loaded into the flip flop chain. When the Reset is high, the mask is shifted by one bit on every falling edge of the clock. Using the falling edge allows the Enable signal to be stable before the rising edge of the clock, thus avoiding glitches when used to gate the clock. The Enable signal generated for each core is used to enable three different components. When a core’s Enable signal is active (high), the TMS control signal is read by the core’s

P1500 Wrapper

WBR

Inputs

Wrapper Boundary Cell

Outputs

Core TAM-in

WIR

WSI

In this section, a model for test time with the TDMTAM is developed. A core’s test time T (in clock cycles) is determined by the scan-in cycles, si , the scan-out cycles, so , and the number of test patterns, N P , according to the following [13]:

WSO

Wrapper Control signals

4 WRCK

WRSTN

TAP controller and the core’s P1500 wrapper. The core is then clocked, allowing it to read data from the WSI and TAM-in. Moreover, the tristate buffer is enabled, thereby allowing the core to write to the WSO and TAM-out. When Enable is deactivated (low), all these components are deactivated, allowing other cores on the same TAM bus to use the bus. Cores tested by a full BIST scheme may not be disabled, so that their test can be executed as quickly as possible, although their TAP controller, P1500, and tristate buffer may still be deactivated. Test power considerations may, however, require that BISTed cores be disabled at times as well. Fig. 5 depicts an SoC containing a number of cores with P1500 wrappers and the TDM-TAP controllers just described. It illustrates what we refer to as a single-branch TDM-TAM where a branch includes the one-bit wide TMS, a one- or more-bit wide TAM-in bus (to accommodate at least the WSI signal) and a one- or more-bit wide TAM-out bus. Hence, the minimum data line width of a branch is three. Each branch also includes global Clock and Reset signals. All the cores connected to a branch have a local TDM-TAP controller and associated logic, and a tristate buffer. They share TMS, Clock, Reset, TAM-in, and TAM-out lines. Different SoCs can have different numbers of TDMTAM branches. A case where all the cores connect to the same branch constitutes the simplest, single-branch case of Fig. 5. The other extreme is to have as many branches as there are cores on the SoC, with each core connected to its own private branch. 2.2. Timing model

TAM-out

Scan chain 0 Scan chain 1

151

16 State TAP FSM

T ¼ ð1 þ maxðsi ; so ÞÞN P þ minðsi ; so Þ. TCK

TRSTN

TMS

Fig. 3. Conceptual view of the IEEE P1500 wrapper and TAP controller.

This equation assumes that a scan-out operation is done concurrently with the scanning in of the subsequent test Reset

TMS

Reset

mask[0] mask[N-2] mask[N-1]

Mask

Clock

Clock

N

Enable

(1)

Time Division Multiplexer Block

Enable

Ctag TAP Controller

1 2 3 4 Wrapper Control Signals

Fig. 4. Block diagram of TDM-TAM TAP controller for P1500 wrapper.

1: CaptureWIR 2: ShiftWIR 3:UpdateWIR 4:SelectWIR

ARTICLE IN PRESS Z.S. Ebadi et al. / INTEGRATION, the VLSI journal 40 (2007) 149–160

152

TAM-in

TAM-out

TAM-in

Core

WSI

Core

WSI WSO

WCK

TAM-out

WSO

WCK

P1500

Enable

P1500 Enable

WRSTN

TDM TAP Controller

WRSTN

TDM TAP Controller

Clock Reset TMS

Fig. 5. Single-branch TDM-TAM.

pattern, and that the application of one scan test pattern requires one system clock cycle. Definition. For an IP core i, MW i ðtÞ is the minimum test bus width (in bits) for which the IP core i test time will be less than or equal to t. For a core with N I inputs, N O outputs, N S scan chains of length l 1 ; l 2 ; . . . ; l N S respectively, and N P test patterns, an upper bound on the core test time, T, can be derived to produce: U T ¼ ð1 þ L þ maxðN I ; N O ÞÞN P þ L þ minðN I ; N O Þ, (2) P S where L ¼ N i¼1 l i . That is because the longest possible test time is achieved when a one-bit wide bus is used as a TAM for the core. In this case all the core’s scan chains and its inputs and outputs must be addressed through the singlebit TAM. Hence, si ¼ L þ N I and so ¼ L þ N O , and maxðsi ; so Þ ¼ L þ maxðN I ; N O Þ and minðsi ; so Þ ¼ Lþ minðN I ; N O Þ. Substituting the latter in Eq. (1), we reach the above expression for U T . Also for the same scenario, the lower bound on the core test time, T, is given by LT ¼ N P l max þ N P þ l max . (3) That is because the shortest test time is achieved when the longest internal scan chain is on a private TAM chain making si ¼ so ¼ l max . Hence si and so are not less than l max . Substituting the latter in Eq. (1), we reach the above expression for LT . Note that the minimum width required to achieve this lower bound is when si ¼ so ¼ l max for all of the TAM chains. Since we know that the si XðL þ N I Þ=W and so XðL þ N O Þ=W , therefore L þ NI ; si ¼ l max ¼) l max si X W L þ NI L þ NI ¼) W X X , W l max

L þ NO ; so ¼ l max ¼) l max W L þ NO L þ NO ¼) W X X . W l max

so X

Hence, considering that TAM width can only be integer values, we obtain   L þ maxðN I ; N O Þ W min ¼ , l max P S where l max ¼ maxi¼1;...; N S l i , L ¼ N i¼1 l i , and d:e denotes rounding up to the closest integer. To refine the test time model, we consider the time required to send instructions to the wrapper, and to put the wrapper in test mode, as well as the time required for loading/capturing a signature (if required). We assume that the cores are tested using either test patterns provided externally via the TAM, or by using patterns generated and evaluated within the core, that is, using a form of BIST. For the BISTed cores, the TAM is only assumed to send a ‘‘start BIST’’ instruction, and to subsequently communicate a BIST result, such as a signature. The cores that are not BISTed require not only test instructions but also test pattern data to be sent via the TAM. Based on the latter observations, we define two parameters for each core: tI and tD . Here, tI is the portion of the core test time during which the core does not need to control or use the TAM, and is therefore independent of the mask assignment. Specifically, the time a BISTed core requires for the BIST test patterns to be generated, applied and evaluated, is part of tI . We define tD as the part of a core’s test time that requires the control/use of the TAM and is, therefore, dependent on the mask assignment. That is, the time required to send a specific test instruction to a core is part of tD .

ARTICLE IN PRESS Z.S. Ebadi et al. / INTEGRATION, the VLSI journal 40 (2007) 149–160

Let the SoC consist of N C cores and N B branches, and assume that core j, 1pjpN C is assigned to branch k, 1pkpN B . The tI portion of the test time of core j, tIj , does not depend on the mask assignment. Hence, the core–branch connections do not affect this part of the core test time. On the other hand, the tD portion of the core test times depend on the cores’ mask assignments and frame lengths, i.e., tDj depends on mask assignments and frame lengths. From the example in Fig. 2, we can see that as the number of ‘‘ones’’ in a core’s mask increases, the proportion of data from each frame used by the corresponding core increases accordingly. Hence, the test time for core j when assigned to branch k is   tDj T jk ¼ (4) M k þ tI j , kmaskj k where M k is the number of bits in a frame for branch k and kmaskj k is the number of ones in the mask assigned to core j. To illustrate, reconsider the example in Fig. 2 and assume that core B is not BISTed and that the TAM-independent test time is 20 cycles, while tDB ¼ 15. For Fig. 2, M k for the bus is 4, kmaskA k ¼ kmaskC k ¼ 1 and kmaskB k ¼ 2. Core B can use two bits of every 4-bit frames. Therefore, testing core B requires d15 2 e ¼ 8 time frames. In turn, this implies a total of 8  4 clock cycles. Let X ij be a 0–1 variable defined as follows:  1 core j is assigned to bus k; X jk ¼ 0 otherwise: Using Eq. (4), the test time for each core can be determined such that total test time for testing all cores assigned to bus C k amounts to T k ¼ maxN j¼1 ðX jk T jk Þ since all the cores can be tested concurrently due to TDM. Assuming that all the TAM branches can be exercised simultaneously, then the B total test time amounts to TT ¼ maxN j¼1 ðT k Þ or, more precisely:   NC NB TT ¼ max max ðX jk T jk Þ . (5) j¼1

j¼1

For a SoC using TDM-TAM and with N C cores, an upper bound on the total test time is given by NC

UPTT ¼ max U i M, i¼1

where U i is the upper bound of the test time for core i given by Eq. (2). That is because the single branch with minimum width TDM-TAM is the worst case for test time of a SoC. To derive an upper bound on the test time, assume that all cores of the SoC lie on one branch having a one-bit wire width for TAM-in/TAM-out. This implies that the test time of each core equals the corresponding upper bound for test time. Assuming that the worst case for test time is C maxN i¼1 U i , and that such a core is not BISTed, then tI of C this core is zero and tD ¼ maxN i¼1 U i . Furthermore, in the worst case, such a core has a minimum share of the bus. That is, it is assigned only one bit in each frame. Under

153

these conditions, the total SoC test time reaches its upper C bound: TT ¼ dt1D eM þ tI ¼ maxN i¼1 U i M. For a SoC using TDM-TAM with N C cores, a lower bound on the total test time is given by NC

LOTT ¼ max Li , i¼1

(6)

where Li is the lower bound on the core i test time given by Eq. (3). That is because the minimum test time is attained when each core has its own branch (N B ¼ N C ) wide enough to test each core at its lower bound test time, so the test time of each core should be the minimum possible test time, Li (given by Eq. (3)). Hence, the total test time, based C on Eq. (5), is maxN i¼1 Li . Also, in this configuration each branch, i, width can be calculated from the minimum width required to achieve the core test time equal to or less than the minimum total test time of the SoC, MW i ðLOTT Þ. Therefore, the TDM-TAM width required for a minimum total test time is W LO ¼

i¼N XC

MW ðLOTT Þ.

(7)

i¼1

2.3. Overhead area model Modeling area for the TDM-TAM is straight-forward since the same additional logic is required for each core. In our area model, we neglect wiring area. The overhead for each core is due to the TDM-TAP controller and a tristate buffer. Hence, for an SoC with N C cores, the total area overhead ATDMTAM ¼ N C ðAbuffers þ ATDMTAP controller Þ þ Awires , where Abuffers and ATDMTAP controller correspond to the area of the buffers and TDM-TAP controller, respectively, and Awires is the area of wires used for TDM-TAM. The number of buffers for each core is equal to the number of TAM-out pins of core, so Abuffers is equal to the number of TAM-out pins (N TAMout ) times the area of a tristate buffer. The TDM-TAP controller, illustrated in Fig. 4, includes a CTAG TAP controller block and a time-division multiplexer (TDM) block. Therefore, the overhead area is ATDMTAM ¼ N C ðAbuffers þ ATDMTAP Þ þ Awires ¼ N C ðAbuffers þ ACTAGTAP þ ATDM block Þ þ Awires . In the TDM block, there are M flip flops and M 2:1 multiplexers. Therefore, ATDM block ¼ MðAFF þ A2:1MUX Þ, where AFF and A2:1MUX is the area of the flip flop and multiplexer, respectively. From the above, the area overhead for the scheme can be expressed as ATDMTAM ¼ N C ðN TAMout Atribuffer þ ACTAGTAP þ MðAFF þ A2:1MUX ÞÞ þ Awires .

ð8Þ

Estimates of actual area for the constituent blocks are given in Table 1 (for a 0:18 mm CMOS technology). In Eq. (8), the area overhead is proportional to the number of

ARTICLE IN PRESS Z.S. Ebadi et al. / INTEGRATION, the VLSI journal 40 (2007) 149–160

154

Table 1 Area estimates for TDM-TAM controller block Circuit

Area ðmm2 Þ

Description

Buffer CTAG-TAP FF & MUX

37 1142 123

Tristate buffer CTAG TAP controller Flip Flop & 2 : 1 Multiplexer

1G 1G 1G 1G

4G

Core A freq=1 GHz

Core B freq=2 GHz

A

cores in the SoC and to the length of the data frames, M (also corresponding to the number of bits required to encode the mask associated with each core). As the frame length increases, the area overhead increases. However, test time will generally decrease with increased frame length, thereby resulting in a usual tradeoff between area and test time. The ability to program the masks at arbitrary times after fabrication is important. This feature can be implemented by adding a new state in the TDM TAP controller to control the shifting-in of a mask to the flip flop chain of the TDM block in Fig. 4. The advantages of this feature are flexibility and more effective post-fabrication (perhaps after an initial test phase) test resource optimization. Note that, the only additional logic required in the TDM block is a multiplexer which should be controlled by the TAP controller to select between the WSI (for shifting in a new mask) and the feedback signal (for normal operation).

2.4. Operation with multi-frequency SoC Today’s SoC could have cores with different frequencies. None of the existing TAMs are suitable for multifrequency SoC testing. The proposed TDM-TAM can handle the multi-frequency SoC, because with the TDM technique, we can manipulate frequency. Consider example in Fig. 2: data fed to core A and core C is a one-fourth of the frequency of the branch. Core B is fed by a half of the frequency of the branch. Also, with n-lines and multiplexers, we can have a line with frequency n times of the frequency of a single line. Fig. 6 shows an SoC with 3 cores, working at two different frequencies. Assume that the ATE can send data at 1 GHz. In Fig. 6, the one-branch configuration that can test each core at its own speed is shown.1 The mask of core A, B and C should be ‘‘1000’’, ‘‘0101’’ and ‘‘0010’’, respectively. This way, if the frequency of the branch is F, the frequency of buses going to cores A, B and C are F4 , F2 and F4 , respectively. Based on this formulation, F should be 4 GHz, but the maximum frequency of the ATE machine is 1 GHz. Using 4 lines and a multiplexer, we can generate a 4 GHz line. With this configuration, we can test each core at its own speed.

1

Note that Fig. 6 does not show TDM-TAM in full detail.

B

Core C freq=1 GHz

C

B

Frame

Fig. 6. Using TDM-TAM for multi-frequency SoC.

3. Optimization In Section 2, we described the basic concepts of TDMTAM, as well as a test time model, and a model for area overhead. One of the most important issues associated with TAM architectures is SoC test time minimization. This section focuses on this issue, and assumes that core designs and their test requirements are fixed. Specifically, we focus on the optimal assignment of cores to test buses and the optimal assignment of masks to individual cores, assuming the TDM-TAM architecture is in place. We address the following problems. Mask assignment: Assuming an SoC using the TDM-TAM scheme and N C cores is assigned to N B branches, determine the optimal mask assignment for each core. Core–branch pairings: Assume an SoC using the TDM-TAM scheme and a total of N C cores is assigned to N B branches; determine the optimal core–branch pairing for each core. The goal in both cases is minimization of test time. These problems are in fact revised from earlier TAM optimization work [14,15], where the objective is to find the optimum configuration for a specific TAM [8], to achieve a minimum test time. In our case (TDM-TAM), the problem is (1) finding the best assignment of cores to buses, and (2) finding the best mask assignments for each core. We use a genetic algorithm (GA) for optimization [15]. Our program requires the following information as input: number of cores, number of branches, the test strategy for each core, and whether any functional (non-scan) test patterns need to be applied, the number and length of the core scan chains, and the number of core input/outputs. Our program outputs an optimal branch configuration and core mask assignments. Example 1. As an example, we consider the U226 SoC benchmark from the ITC’02 SoC test benchmarks [16]. The characteristics of this benchmark are reported in Table 2. We assume a TDM-TAM design for this SoC and the application of our optimization program to find the optimal parameters for the TDM-TAM. We assume two branches with minimum width, i.e., N B ¼ 2, and the number of bits/frame to be 16 in both cases, i.e., M 1 ¼ M 2 ¼ 16. As five of the nine constituent cores are

ARTICLE IN PRESS Z.S. Ebadi et al. / INTEGRATION, the VLSI journal 40 (2007) 149–160

155

Table 2 U226 from ITC SoC’02 benchmarks Core

No. of primary I/Os

No. of test patterns

No. of scan chains

Scan chain lengths

TAM use

1, 2, 3 4, 5, 6 7 8 9

2/1 3/17 97/64 34/32 17/10

1,363,968 2666 76 1,048,576 15

0 0 20 0 0

– – 52 – –

n y y n y

TAM-in 1

TAM-in 1 core 4

core 5

core 9

mask(7)

mask(7)

mask(2)

core 4

core 5

mask(8)

mask(8)

TAM-out 1

TAM-out 1 TAM-in 2

TAM-in 2 core 6

core 7

core 6

core 7

core 9

mask(6)

mask(10)

mask(6)

mask(9)

mask(1)

TAM-out 2

(a)

TAM-out 2

(b)

Fig. 7. Optimal configurations for SoC U266 assuming two branches: (a) total effective TDM-TAM width ¼ 2 and (b) total effective TDM-TAM width ¼ 3.

2 This is the total number of input/output TAM lines excluding the necessary control lines like TMS, Clock and Reset.

4

15

Total Test Time

assumed to be connected to the TAM for this benchmark, the problem is optimally assigning these five cores to two branches and making optimal mask assignments for each of the cores. Applying our optimization program yields the assignment of cores 4, 5 and 9 to a first branch, and cores 6 and 7 to the second, as illustrated in Fig. 7(a). For the first branch, the optimal mask assignment is such that kmask4 k ¼ kmask5 k ¼ 7 bits, and kmask9 k ¼ 2 bits, while those for the second branch are such that kmask6 k ¼ 6 bits and kmask7 k ¼ 10 bits. This configuration and assignment yields a total test time of 140,160 cycles. This test time is for the case where the branches are both of minimal width (i.e., each TAM branch consists of only one available data line) and hence the total effective TDM-TAM width2 is 2. However, when the total effective TDM-TAM width is 3 the optimal configuration differs (Fig. 7(b)). In this case, branch one is attributed an effective width of one bit (minimal width), with cores 4 and 5 are assigned to it and such that kmask4 k ¼ kmask5 k ¼ 8; while branch two is attributed an effective width of two bits, with cores 6, 7 and 9 are assigned to it and such that kmask6 ¼ 6k, kmask7 k ¼ 10 and kmask9 k ¼ 2. The total test time for this configuration amounts to 95,984 cycles. In Fig. 8, the total test times (cycles) versus total effective TDM-TAM widths, assuming two-branch configurations, are reported for benchmark U226.

Test Time vs. Width for U226

x 10

10

5

0

2

3

4

5

6

7

8

Width (bits)

Fig. 8. Benchmark U226: Test time (cycles) versus total effective TDMTAM width for two-branch configurations.

Example 2. In this example we consider the D695 SoC benchmark from ITC’02 SoC test benchmarks [16]. The test data for cores in the D695 is shown in Table 3. In the first step of TAM design, we should apply available wrapper design algorithms [17,18], and for each core calculate the test time for different widths. The result of wrapper design algorithm for the cores in D695 is reported in Table 4. Note that cores 1 and 2 are not scan-testable

ARTICLE IN PRESS Z.S. Ebadi et al. / INTEGRATION, the VLSI journal 40 (2007) 149–160

156 Table 3 Test data for the cores in D695 Circuit (core)

No. of primary I/Os

c6288 c7552 s838 s9234 s38,584 s13,207 s15,850 s5378 s35,932 s38,417

No. of test patterns

64 315 35 75 342 214 227 84 355 134

No. of scan chains

12 73 75 105 110 234 95 97 12 68

Scan chain length

– – 1 4 32 16 16 4 32 32

Min

Max

– – 32 52 44 39 33 44 54 51

– – 32 54 45 41 34 46 54 55

Table 4 Test data for the cores in D695 Width ðW Þ

1 2 3 4 5 6 7 8 9 10 11 12 13 14–15 16 17 18 19 20 21–24 25–31 32 33 34 35 36 37 38 39

Cores test time (cycles) 3

4

5

6

7

8

9

10

5058 2658 2507

26,602 13,354 11,129 6782 5829

191,874 95,992 61,870 48,106 38,478 32,164 27,613 24,163 21,518 19,646 17,624 16,205 14,983 14,762 12,192 11,420 10,869 10,319 9989 9989 9878 6206 5985 5765 5655 5435 5325 5215 5105

185,489 105,798 62,246 73,400 37,338 31,240 27,964 54,610 20,906 19,034 19,034 18,799 18,799 18,564 11,978 11,273 10,571 10,103 9869

65,381 29,393 21,926 16,477 13,224 11,027 9966 8341 7383 6716 6431 6431 6431 6431 4219 4026 3739 3549 3454 3359

22,327 11,243 8691 5769 4605

26,351 13,182 8802 6597 5310 4440 3798 3305 2964 2820 2418 2226 2118 2118 1659 1572 1488 1416 1416 1416 1416 836 822 798 774 750 738 714

22,580 11,296 6934 5660 4653 3990 3327 2836 2664 2664 2073 2001 2001 2001 1374 1338 1338 1338 1338 1338 1338 763 739 727

cores, so the test time for them is equal to the number of test patterns for all the widths. In Table 4, the test time for W ¼ 1 is actually the upper bound on test time (U T ) for the core, and the last number of each column is the minimum test time (LT ) and corresponding width is W min of the core. First, we try to design a one branch TDMTAM with minimum width for this SoC. In the worst case the core with largest U T has the minimum share of the bus, so here if we assume that M ¼ 16 the upper bound on the

SoC test time would be 16  maxi¼1...10 ðU T i Þ ¼ 16 191; 874 ¼ 3; 069; 984. However, if we apply our optimization algorithm to find a better masking scheme, we save 200% in the test time. The best masking scheme (for M ¼ 16) is 1, 1, 1, 1, 3, 3, 3, 1, 1, 2 for core 1; . . . ; 10, respectively, giving test time of 1,023,328. As mentioned before, with the increase of M (the number of bits in the mask), the test time could decrease. For example, for M ¼ 32 the best masking scheme is 1, 1, 1, 3, 10, 8, 3, 1, 1, 5, and

ARTICLE IN PRESS Z.S. Ebadi et al. / INTEGRATION, the VLSI journal 40 (2007) 149–160

the corresponding test time is 425,632 (i.e., 140% further improvement). The lower bound on the total test time for D695, as given by Eq. (6), is the maximum of all the core’s test time lower bound. From Table 4, core 6 has the largest lower bound (LT 6 ¼ 9869), so the lower bound on the total test time of D695 SoC is 9869. The required width Pto achieve this lower bound,3 as given by Eq. (7), is W ¼ 10 i¼1 W i ðT i o9869Þ ¼ 1 þ 1 þ 1 þ 4 þ 32 þ 20 þ 9 þ 3 þ 3 þ 3 ¼ 77. In this design each core i has its own private branch width W i ðT i oLOTT Þ and the mask of each core is M. Later we show that with better design, the width to attain the LOTT can be reduced. For example in the following design, Fig. 9, the total test time has reached its lower bound (9869). However, the total width is 65. That is, we save 12 lines in the TAM width.

157

20 Branch 1 width=20

Core 6 ||mask||=32

32 Branch 2 width=32

Core 5

Core 7

||mask||=21

||mask||=11

7 Branch 3 width=7

Core 4

Core 9

||mask||=19

||mask||=13

3

4. Case study 4.1. SoC platform description To evaluate the effectiveness and tradeoffs associated with the proposed TDM-TAM, a network processor unit (NPU) design was developed and used as a target test vehicle. Our NPU is an OSI Layer 3 device that forwards IPv4 packets [19]. The NPU’s major blocks include a preprocessing unit, a classifier, an embedded processor, several post-processing units, and various memory components. The blocks communicate with each other through an AMBATM -compliant high-speed bus [19]. Point-to-point connections are also used for control signals and interrupts. The NPU blocks were developed following reuse guidelines such as I/O buffering and sub-blocks partitioning. The embedded processor is a modified Motorola HC11TM micro-controller [19]. The AMBA-compliant bus was developed using the AMBA AHB specifications available from ARM Ltd. The bus uses a master-slave request-based architecture with multiple clock cycles per transfer. The pipelined bus design yields an efficient implementation that satisfies the high-bandwidth requirements of the NPU. With regard to the core-level test methodology, a fullscan test methodology is assumed for the preprocessing and the post-processing units. One single scan chain was created for each block, and scan vectors were generated by an ATPG tool and assumed to be provided to the blocks from an external source. To emulate a heterogeneous test methodology environment, the classifier and embedded processor blocks are assumed to be tested using a logic BIST methodology. The logic BIST uses a 32-bit linear feedback shift register (LFSR) to generate pseudo-random test vectors and uses a signature analyzer to compact the test result. The memory modules use memory BIST that 3 For all the cores the width required to achieve core test time p9869 has been highlighted in the Table 4.

Branch 4 width=3

Core 3

Core 10

||mask||=9

||mask||=23

3 Branch 5 width=3

Core 1

Core 2

Core 8

||mask||=1

||mask||=1

||mask||=30

Fig. 9. Benchmark D695: design for minimum test time, 5 branches, with mask 32 bits, total width ¼ 65, total test time ¼ 9869.

runs a Marching C algorithm. All the blocks and the associated test structures are encapsulated with P1500compliant wrappers [12]. As illustrated in Fig. 10, we compared three different TAM architectures using the NPU as a target design. The first TAM, referred to as Serial P1500, leverages the new P1500-compliant wrappers. The proposed P1500 standard architecture resembles the STD 1149.1 Test Access Port and Boundary Scan architecture. The most noticeable difference is the removal of the TAP controller and the addition of a parallel test port. By removing the TAP controller and providing more access ports, the serial input constraint of STD 1149.1 is removed and a wide variety of TAMs are supported. It is possible to serially thread P1500 wrappers to create a simple TAM for a SoC. The second TAM is a test NoC called NIMA [7]. The basis of NIMA is the establishment of indirect digital communication paths between cores through the use of a packet-switching connections. In our specific case, we use the on-chip network for test purposes. Hence, the test methodology when assuming NIMA consists of converting test vectors into packets that can find their way from a source to their respective destination cores. Similarly, test results can also be transferred from cores to a sink. The

ARTICLE IN PRESS 158

Z.S. Ebadi et al. / INTEGRATION, the VLSI journal 40 (2007) 149–160

Test NoC (NIMA)

Serial P1500

N

TDM (Bus-based)

Network

Fig. 10. TAM architectures considered in our comparison.

underlying assumption here is that all test-related communications destined/originated to/from the SoC need to be communicated using the on-chip network. Finally, the third TAM used in our comparison is the TDM-TAM. While all three TAMs considered in our study use P1500 wrappers, the NIMA TAM requires that test data be preprocessed into packets. Therefore, with NIMA, test data (either scan test vectors and/or BIST instructions) must initially be converted into test packets and subsequently sent to designated blocks. In turn, to translate the test packets back into information understandable by a P1500 wrapper, a NIMA interface module is required for each SoC block. A controller within the NIMA interface module generates the P1500 wrapper control signals. Since the P1500 specifications do not include the implementation of such a controller, we assumed the use of the 1149.1 TAP controller for this purpose. This controller is responsible for generating the required P1500 control signals from the NIMA test packets such that the P1500 wrapper perform the operations outlined in the standard. As mentioned above, two test methodologies were applied to our NPU design, namely those of full scan and BIST. The BIST approach possesses the advantage of lower test traffic compared to the case of full scan. Moreover, BIST eliminates the need for off-chip vector storage and management, thereby simplifying the required ATE. On the other hand, the BIST methodology used here is based on pseudo-random pattern generation. The latter typically suffers from a reduced fault coverage in comparison to full scan unless special measures are taken. For example, one way to mitigate this drawback is to boost the number of BIST test patterns in comparison to the number of full-scan test patterns. This, in turn, amounts to longer test times. Consequently, it is evident that various test methodologies will have different impacts on the choice and effectiveness of different TAM architectures. 4.2. Results We derived specific test time and area overhead models for the TDM-TAM when the latter is applied to the NPU design described above. For test time, we derived expressions that depend on a core’s test methodology, i.e., whether full scan or BIST. For BISTed cores, we assume that a given set of instructions are required to initiate and complete the BIST while the actual test (application of test

patterns and signature generation) can proceed independently from the specific TAM architecture. Hence, for a BISTed core, tD is the time required to send the BIST instructions while tI is the core test time. Based on simulations we performed on our NPU design, the times required to implement core BIST functions were as follows: Instruction

Time (cycles)

Reset Start BIST Load reference signature Capture signature

1 11 38 16

Hence, for BISTed cores: tD ¼ Reset þ StartBIST ¼ 12 cycles, tI ¼ tc ¼ CoreScanLength  CoreVectorCount. Cores tested using a full-scan test methodology must use the TAM throughout the entire test session. Hence, tI ¼ 0 for such cores. Based on our simulation experiments, tD includes 3 times loading instruction, one loading signature and one capture signature plus the time to load test pattern which is VectorCount  ðScanLength þ 5Þ. So for the cores using full-scan: tI ¼ 0, tD ¼ 2 þ 3  LoadInstruction þ LoadSignature þ VectorCount  ðScanLength þ 5Þ þ CaptureSignature. The above formulae for tI and tD are not general and will differ for different designs. Nevertheless, our simulation experiments with our NPU show that these formulae are relatively accurate for test time predictions for our NPU design. The test time prediction errors appear to be bounded by M cycles in our specific case. For the area overhead associated with TDM-TAM when applied to our NPU, we used the area model described in Section 2.3. Results for the frame lengths of 16 and 32 are reported in Table 5, assuming the NPU design implemented in a 18 mm CMOS technology. The area overhead for different TAMs are compared in Table 5. The serial TAM has the minimum area overhead, and that for TDM is very close, while the test time for

ARTICLE IN PRESS Z.S. Ebadi et al. / INTEGRATION, the VLSI journal 40 (2007) 149–160 6

Table 5 Overhead area comparison

3

Overhead area mm

TDM (with 16 bits mask) TDM (with 32 bits mask) NIMA Serial

2

18,882 30,690 1,343,236 5041

1 2 3 4 5 6

scenario 6

2.5 scenario 5

% 0.75 1.22 53.3 0.2

Table 6 The scenarios for assessing TAM performance

Scenario Scenario Scenario Scenario Scenario Scenario

x 10

Serial TAM NIMA TDM

Total test bits

Core that use full scan

420,014 522,460 692,364 972,072 1,789,122 2,769,456

Pre and post Pre, post and HC11 Pre, post, HC11 and classifier Pre, post  4, HC11, and classifier Pre, post  4, HC11  4, and classifier Pre  4, post  4, HC11  4, and classifier  4

TDM is much shorter than that of the serial TAM. The area overhead for the NIMA is very large as compared to the other two TAMs. For the test time, the three TAMs were compared for six different scenarios with each new scenario corresponding to an increased number of cores tested using a full-scan test methodology, thereby corresponding to increased test data volumes for each new scenario in Table 6. Fig. 11 illustrates the total test time (measured in clock cycles) required for testing the NPU assuming full-scan for NIMA and TDM of different widths (from 2 to 5 bits) and the serial TAM. The horizontal axis indicates the total test data (in bits) transferred. The BIST execution time is approximately 530,000 cycles for all six scenarios, and it is chosen as the threshold for total test time. From Fig. 11, it is obvious that the serial TAM has the worst test time and TDM with width 5 has the minimum test time. The test time of NIMA is very close to that of the TDM, but the area overhead for NIMA is much larger than for TDM-TAM.

5. Conclusion We proposed a new bus-based TAM which is scalable and outperforms two other TAMs in terms of test time and overhead area. The proposed TAM uses the concept of time-division multiplexing (TDM) to effectively reduce TAM area requirements while still achieving good test time performance. This is possible by making optimal assignments of cores to buses and optimal assignment of time slots in the TDM scheme. We illustrated the use of a genetic algorithm-based methodology to achieve such optimal assignments. We presented test time and area requirement models for our TDM-TAM.

Testing time (cycles)

TAM type

159

2 width=2 scenario 4

1.5 width=3

scenario 3

1

scenario 2

scenario 1

0.5

0

width=4 width=5

0

0.5

1

1.5

2

2.5

Total bits transferred

3 6

x 10

Fig. 11. Test time for serial P1500, NIMA, and TDM-TAM.

TDM-TAM could be a solution for testing multifrequency SoCs. By using TDM-TAM for testing TDMTAM not only we can test each core at its frequency, but also we can test the SoCs not using high-frequency and complicated (expensive) ATEs. TDM-TAM not only generally offers excellent time and area performance, but it also enables optimal reconfiguration of test resources post-fabrication using the ability to program the mask. This ability is particularly attractive for addressing test requirements that cannot be anticipated pre-fabrication, for example in cases requiring increased diagnostics due to one or more faulty cores on an SoC. We implemented the TDM-TAM on a network processor unit design, and compared area and test time of TDM-TAM to other proposed TAMs. We illustrated how TDM-TAM offers an attractive alternative to such TAMs, because of its smaller area requirements and comparatively smaller test time.

References [1] Y. Zorian, E. Marinissen, S. Dey, Testing embedded-core based system chips, in: Proceedings of the International Test Conference, 1998, pp. 130–143. [2] V. Immaneni, S. Raman, Direct access test scheme-design of block and core cells for embedded asics, in: Proceedings of the International Test Conference, 1990, pp. 488–492. [3] I. Ghosh, N. Jha, S. Dey, A low overhead design for testability and test generation technique for core-based systems, in: Proceedings of the International Test Conference, 1999, pp. 50–59. [4] N. Touba, B. Pouya, Testing embedded cores using partial isolation rings, in: Proceedings of the International Test Conference, 1997, pp. 10–16. [5] L. Whetsel, An IEEE 1149.1 based test access architecture for IC with embedded cores, in: Proceedings of the International Test Conference, 1997, pp. 69–78. [6] L. Whetsel, Core test connectivity communication and control, in: Proceedings of the International Test Conference, 1998, pp. 303–312.

ARTICLE IN PRESS 160

Z.S. Ebadi et al. / INTEGRATION, the VLSI journal 40 (2007) 149–160

[7] M. Nahvi, A. Ivanov, A packet switching communication-based test access mechanism for system chips, in: Proceedings of the European Test Workshop, 2001, pp. 81–86. [8] E. Marinissen, A structures and scalable mechanism for test access to embedded reusable cores, in: Proceedings of the International Test Conference, 1998, pp. 284–293. [9] P. Varma, S. Bhatia, A structured test re-use methodology for corebased system chips, in: Proceedings of the International Test Conference, 1998, pp. 294–302. [10] L. Whetsel, Addressable test ports: an approach to testing embedded cores, in: Proceedings of the International Test Conference, 1999, pp. 1055–1064. [11] K.S. Shanmugam, Digital and Analog Communication Systems, Wiley, New York, 1979 (Chapter 10). [12] IEEE P1500 web site, http://grouper.ieee.org/groups/p1500/ (2001). [13] S. Goel, E. Marinissen, TAM architecture and their implication on test application time, in: Proceedings of the International Workshop on Test Embedded Core-based Systems, 2001, pp. 3.3–1–10.

[14] K. Chakrabarty, Design of a system-on-chip test access architecture using integer linear programming, in: Proceedings of the VLSI Test Symposium, 2000, pp. 127–134. [15] Z. Ebadi, A. Ivanov, Design of an optimal test access architecture using genetic algorithm, in: Proceedings of the Asian Test Symposium, 2001, pp. 205–210. [16] ITC’02 SoC Test Benchmarks, http://www.extra.research.philips. com/itc02socbenchm/ (2002). [17] E. Marinissen, S. Goel, M. Lousberg, Wrapper design for embedded core test, in: Proceedings of the International Test Conference, 2000, pp. 911–920. [18] V. Iyengar, K. Chakrabarty, E.J. Marinissen, Test wrapper and test access mechanism co-optimization for system-on-a-chip, J. Electron. Testing: Theory Appl. (JETTA) 18 (2001) 213–230. [19] L. Hong, M. Nahvi, R. Fung, A. Ivanov, R. Saleh, Novel test methodologies for soc/ip design implementation and comparison, in: Proceedings of the IEEE International Workshop on System-on-Chip for Real-Time Applications, 2002, pp. 20–30.