Conditional time resolved photoemission for debugging ICs with intermittent faults

Conditional time resolved photoemission for debugging ICs with intermittent faults

Microelectronics Reliability 48 (2008) 1289–1294 Contents lists available at ScienceDirect Microelectronics Reliability journal homepage: www.elsevi...

1MB Sizes 0 Downloads 80 Views

Microelectronics Reliability 48 (2008) 1289–1294

Contents lists available at ScienceDirect

Microelectronics Reliability journal homepage: www.elsevier.com/locate/microrel

Conditional time resolved photoemission for debugging ICs with intermittent faults Frank Zachariasse *, Jan van Hassel NXP Semiconductors, QAS Failure Analysis, Gerstweg 2, 6534AE Nijmegen, The Netherlands

a r t i c l e

i n f o

Article history: Received 27 June 2008

a b s t r a c t Intermittently failing IC’s are difficult to debug with techniques such as time resolved photoemission (TRE), that measure internal signals, because the measurements will contain a mixture of passing and failing behaviour. In this paper, we show that by swapping 2 BNC cables on the outside of an Emiscope II TRE instrument, it becomes possible to measure separately both the passing and failing behaviour of an intermittently failing IC. We illustrate the techniques in two case studies. Ó 2008 Published by Elsevier Ltd.

1. Introduction

2. Time resolved photoemission

This paper introduces a technique to measure internal signals in an integrated circuit that exhibits intermittent failure. A characteristic of such an IC is that, when repeatedly tested, the IC sometimes passes and sometimes fails, and the outcome of each test cannot be predicted beforehand. If the IC cannot be brought to a state where it is failing consistently, then measurement techniques such as time resolved photoemission (TRE [1]) are difficult to apply to the investigation. This is because such measurements are accumulated over many repeated tests, with the assumption that the IC behaves identically during every test. Applying them to an intermittently failing IC results in a mixture of ‘pass’ and ‘fail’ behaviour in each measurement, that may hide the failing behaviour of the IC. This paper presents a solution to this problem, achieved by filtering the collected signal such that only failing test loops are incorporated into the measurement result. Alternatively, one may record only the passing tests instead, under the same test conditions. Comparing these two separate behaviours can yield a new insight into the problem, not otherwise obtainable. We name this technique ‘‘conditional time resolved photoemission” (CTRE). Although the principle of such a measurement was already foreseen in early work on PICA [2], reports of a practical implementation have not, as far as we are aware, been published until now. We start by explaining how to set up a commercial time resolved emission instrument (Emiscope II [3]) to perform CTRE. Practical application of the technique is then demonstrated in two case studies. The first case concerns intermittent failure of an I2C interface, where the failure only occurred a maximum of 3% of tests. The second case study shows how CTRE helped to identify which machine code instruction was the first to cause the failure, when a digital signal processor failed to reach its design speed.

To perform a TRE measurement, an IC is programmed to perform a test repeatedly, in a ‘test loop’. The Emiscope instrument observes a certain logic gate on the IC. Photons, emitted when that logic gate switches, are collected and their time of arrival is recorded in a histogram. The start of the histogram is marked by a ‘trigger’ pulse, given to the instrument at the start of every loop. After many repeats of the test loop, the histogram shows peaks of photon emission, that mark the time of a particular switching event, after the trigger.

* Corresponding author. Tel.: +31 243532964; fax: +31 243533659. E-mail address: [email protected] (F. Zachariasse). 0026-2714/$ - see front matter Ó 2008 Published by Elsevier Ltd. doi:10.1016/j.microrel.2008.06.034

3. Conditional time resolved photoemission 3.1. Photon counting schemes Timing measurement in the Emiscope II is done by the ‘picosecond timing analyser’ (pTA) [4]. The pTA is a high performance timer-counter, capable of measuring time intervals to picosecond resolution: Effectively, the pTA starts a counter when a ‘start’ signal is received, and stops that counter on receiving a ‘stop’ signal. The counted value is the photon arrival time. This measurement is used to update a histogram of arrival times. Conventionally, the start signal is the trigger signal, which also switches on the photon detector after a programmed delay (see Fig. 1). The detection of a photon produces the stop signal. These signals enter the pTA through two separate BNC cables. The measured time difference between these is the photon arrival time after the trigger. In the Emiscope II instrument, the photon detector may be chosen to be on, only during a given portion of the test loop. Both the time offset from the trigger to the start of the detector ‘window’, and the length of the window may be chosen. Similarly, the pTA is receptive to a stop event only during a defined ‘time window’, as shown in Fig. 1. For the scheme of Fig. 1, the photon detector and pTA must be enabled during the same portion of the test loop.

1290

F. Zachariasse, J. van Hassel / Microelectronics Reliability 48 (2008) 1289–1294

Trigger = Start event

Trigger .... etc

Det. offset Detector on PTA offset PTA window

Photon = Stop event

Histogram peak position = time after trigger

Time axis

Fig. 1. Measurement scheme for conventional TRE. The triangle indicates a photon received by the photon detector.

In practice, there will me many test loops during which no photons are detected, because photon emission is a comparatively rare event. During these loops, the pTA sees no ‘stop’ event during the active windows, and so does not increment any bin in its timing histogram. For CTRE, an alternative collection scheme is used. This is achieved by swapping the start and stop inputs on the pTA, simply by exchanging two BNC connectors accessible at the front of the instrument. This scheme is shown in Fig. 2. By swapping start and stop inputs, the pTA counter is now started only when a photon arrives. The next trigger after the photon stops the counter. In this alternative measurement scheme, the time being measured is actually the photon arrival time before the trigger. As long as the test loops are periodic, with a constant time between triggers, this is an equally valid measurement. However, since earlier photons come longer before the second trigger, swapping the start and stop cables on the Emiscope II has the effect of reversing the time axis on collected traces. Knowing this, however, the interpretation of the results is in no way impaired. In addition, for this scheme, Fig. 2 shows that the windows for the photon counter and pTA are no longer open at the same time. The Emiscope still turns on the photon detector a given time after

the first trigger pulse, no matter how the pTA is connected. The pTA window must however be open during the second trigger pulse, to accept the stop event. Therefore, we must ensure that window lengths and offsets, for both detector and pTA are set appropriately. 3.2. Filtering the collected photons The above two collection schemes could both be used to collect conventional time resolved photoemission measurements. However, we now show that only the swapped scheme makes it possible to perform CTRE. The pTA is equipped with ‘inhibit’ inputs that make it possible to reject some counts. If a TTL logic ‘1’ signal is present on the ‘‘stop inhibit” when a stop signal arrives, then the stop signal is ignored by the pTA, and thus that photon is never included in the histogram. Fig. 3 shows where these inputs are located on the front panel of the pTA. In order to filter the signal, our device under test (DUT) needs to supply a TTL output that is ‘0’ for fail, and ‘1’ for pass. This signal is then fed into the stop inhibit input of the pTA. Referring to Fig. 1, we see that in order to filter the arriving photon according to pass or fail, the pass/fail result would have to be already available at the time when the photon arrives. Clearly this

Previous Trigger

Trigger = Stop event

Det. offset Detector on

PTA offset PTA window

Photon = Start event

Histogram peak = time BEFORE trigger

Time Fig. 2. Measurement scheme for CTRE, obtained by swapping start and stop inputs.

F. Zachariasse, J. van Hassel / Microelectronics Reliability 48 (2008) 1289–1294

START Input

1291

this tester is then programmed to output the test result, pass or fail, onto a signal wire which the user can connect to the pTA for CTRE filtering. Since the actual testing of the IC happens at the normal mode clock, yet the result becomes available only after the scan chain has been scanned out and evaluated in the tester, the trigger signal in this case has been arranged to occur at the point where the test result is already available. Alternatively, depending on the problem being investigated, the IC may perform an internal test, for example by means of a program running on a microprocessor inside the IC. In such a case, it is possible that the test itself is far shorter than the minimum test loop length of 2 ls that the Emiscope II can accommodate. Thus, multiple tests can be performed during each trigger loop, as illustrated in Fig. 5. In Fig. 5, the top trace represents the trigger signal. The bar at the bottom of the figure represents a program, being performed inside the IC. In this program, multiple tests are performed between trigger signals, here marked Test 1, 2 and 3. The centre trace represents the pass–fail signal output from the IC. In this case, there is a delay between the test being performed and the pass–fail output becoming available. Hence, when the second trigger arrives, it is test 2 that is being filtered, although all 3 tests can be measured with TRE. In this case, the filtered TRE data only applies to the data taken when the IC was actually performing test 2. Such a scheme therefore requires careful attention to the timing of the pass–fail signal and the portion of the histogram to which the filtering applies.

“STOP inhibit” Input

STOP Input

Fig. 3. pTA front panel, showing the location of start, stop, and stop inhibit inputs.

is impossible: The switching event that gave us this photon may even happen before the pass/fail test for that loop is performed. For the CTRE scheme in Fig. 2, however, the pass/fail signal does not need to be available when the photon arrives, but only when the next trigger arrives. This makes it possible to build in any amount of time to calculate the test result, if necessary by lengthening the test loop. The only requirement is that the duration of the test loop must be constant, independent of pass or fail. Thus, by the simple action of swapping the cables carrying the start and stop signals on the front of the instrument, and supplying a pass/fail signal to the stop inhibit input, CTRE becomes possible. Fig. 4 shows how such a scheme could work if a scan test pattern were applied to the IC. Scan test consists of three phases: Scanning in data serially into the IC, performing a ‘normal mode’ clock, when outputs of combinatorial logic are sampled, and scanning the sampled data serially out of the IC. Typically, the test result that relates to the test is evaluated inside a digital IC tester and

4. Case study 1: intermittently failing I2C interface CTRE was used to debug a design fault in an I2C interface on a complex IC. An I2C interface is a low speed serial interface, making use of a bi-directional data signal SDA, controlled by a clock signal SCL. The interface is controlled by a state machine, that controls data transfers across the bus. Since multiple ICs can be connected

Trigger Pass/Fail Test Scan in Scan

NM in NM

Scan Scanin in

Scan out Scan out

NM NM

Scan out Scan out

Photon Fig. 4. Typical test scheme for performing CTRE when the IC is scan tested. Only one test is done in each trigger loop. The trigger is arranged to arrive when the pass/fail test result of the previous test is available from the digital tester.

Trigger

Test 1

2

Pass/Fail 1

Pass/Fail 2

3

...

Pass/Fail 3

Fig. 5. Example of CTRE tests with more than one test performed between trigger moments.

1292

F. Zachariasse, J. van Hassel / Microelectronics Reliability 48 (2008) 1289–1294

was only 3%, the acquisition time for the fail signal was in this case 30 times longer than for the pass signal. These measurements, taken at various points inside the state machine, indicated that the state machine was running through a particular series of disallowed states during the failing cycles. Data taken of the same points without the CTRE set-up showed masked this incorrect behaviour, since the 3% of failing loops were lost in the noise. The root cause of the problem was found in the fact that a ‘raw’, unsynchronised version of the SCL signal was being used to enable two separate flip-flops, as shown in Fig. 7. The failure would occur only under certain, special circumstances, shown in Fig. 8: If the timing happened to be such that the SCL rising edge coincided with the 40 MHz clock, then it was possible for Q1 to be correctly enabled, while Q2 was not enabled until after the clock edge. This would lead to a wrong branch being taken in the state machine, since the design assumed both flip-flops to be enabled together. The understanding gained from the CTRE measurements enabled the designers to simulate the circuit under both passing and failing conditions, and to compare CTRE results and simulation results, as shown in Fig. 9. This gave full confidence that the problem had been understood, and that the design fix (to use a synchronised version of the SCL signal to for both E1 and E2) would work.

to the same I2C bus, and some of these can act both as master and slave at different times, situations of data collision and bus arbitration need to be handled correctly in the design of an I2C interface controller. It was found that the interface operated correctly, but that, with the I2C bus loaded with certain capacitance values, approximately 2–3% of I2C transactions would be terminated early by the IC, indicating an error in the controlling state machine. This failure was intermittent, and it was not possible to increase the failure rate to 100%. As part of their investigation, the designers were keen to know how the state machine was behaving during the failing tests, and during the passing tests. This could have been achieved by other means, for example through probing internal signals with needles, but CTRE was found to be a faster and more convenient method. The advantages of TRE, such as its non-invasive nature, and its ability to probe multiple nodes in the circuit without needing to make additional probe pads, also apply to CTRE. An application program was written to send data repeatedly across the I2C bus, testing the correct performance of the interface after every transfer. Care was taken to ensure that the test loop was of equal duration both for passing transfers and failing transfers. Trigger and pass/fail outputs were generated by the same IC, through general purpose digital IO pins. A signal was measured for both failing and passing situations of certain signals in the state machine (Fig. 6). Since the rate of failure

Time PASS

FAIL

Fig. 6. TRE signals measured inside the I2C state machine for passing and failing tests.

D1 Q1

SCL bondpad 400 kHz

E1

I2C I2CState State Machine Machine

D2 Q2 E2

CLK 40 MHz Fig. 7. Simplified block diagram of the I2C state machine.

SDA bondpad

1293

F. Zachariasse, J. van Hassel / Microelectronics Reliability 48 (2008) 1289–1294

E1 E2

SCL

active edge

CLK Q1 Q2 Fig. 8. Situation at the flip-flops shown in Fig. 7, during the failure.

5. Case study 2: identifying the first failing instruction This case concerns a digital signal processing (DSP) block on an IC, which could not reach the design speed, but started to yield incorrect results to certain DSP calculations at a significantly lower speed than expected. It was found that a particular short section of 57 DSP machine code instructions would yield the incorrect result. Therefore, one of the tasks was to discover which of the 57 instructions was the first to lead to a failure. Since the DSP could only run code from read-only memory, and there was no facility built into the IC to stop the program at a particular processor cycle, conventional debugging methods were not easy to use on this problem. In this particular case, it was possible to obtain a 100% failure rate from the IC, so the failure was not ‘intermittent’ by force. However, by deliberately choosing to operate the IC on the border of pass and fail, so that its behaviour became intermittent, we were able to use CTRE to measure internal signals for failing and passing tests. For all these results, the internal operating conditions (voltage, temperature, etc.) of the chip are identical for both passing and failing cases. Hence the propagation times of all internal signals are also the same for pass and fail. This makes it possible to

subtract the passing and failing photon count histograms from each other. For all signals that do not relate to the failure, the histogram peaks are the same (to within measurement noise), and thus subtracting them yields zero signal. Signals that do differ between pass and fail, on the other hand, show up when subtracted from each other. The principle is illustrated in Fig. 10: If we start our test loop with ‘fresh’ data and carry out a set of calculations on that data, then, by definition, the result of every instruction before the first failing instruction will be the same, for both pass and fail (Fig. 11). The results of all instructions after the first failing one are ‘tainted’ by the first failing data. Therefore, we can subtract pass and fail measurements of results, and the first ‘peak’ shown indicates the first failing instruction. The CTRE data showed that the first failing instruction, was also the first one to use the arithmetic shift unit (ASU) of the digital signal processor. In this debug case, LADA [5,6] was additionally used to show a critical path in that shift unit (Fig. 12), and other TRE measurements along this path indicated how to solve the criticality of that net. The contribution of CTRE to this case was to show that the failing instruction was indeed the one that depended on the signal nets indicated by the other methods.

Time 1

2

3

4

5

1

2

3

4

5

1

Fail result

P

P

P

F

F

P

P

P

F

F

P

Pass result

P

P

P

P

P

P

P

P

P

P

P

Instruction

...

Instruction #4 is the first failing Fig. 10. Repeating a calculation with ‘clean’ data at the start of each loop. The first cycle showing ‘bad’ data in a result register is the first instruction causing the failure.

Fig. 9. CTRE results and simulation results for pass and fail. The white oval shows the disallowed path being taken through the state machine in the failing cases.

1294

F. Zachariasse, J. van Hassel / Microelectronics Reliability 48 (2008) 1289–1294

Pass

Fail Clock signal (above) : Shows no difference between pass and fail. Fault not related to clock in this case. Data signal (below) : Shows incorrect data appearing in a result register : this indicates the first failing instruction occurred just before that point

Pass

Fail

Fig. 11. Shows that the clock signal of the DSP did not differ between pass and fail – whilst a given bit in the result accumulator did differ – the time of the first failing result indicated the first failing instruction.

Fig. 12. The left hand frame shows LADA results, pointing to a particular net in the ‘‘arithmetic shift unit” (ASU) of the digital signal processor (block diagram on the right). This result confirmed the findings of CTRE, which showed that the first failing instruction was one that made use of the ASU unit.

6. Conclusions In this paper, we have shown that it is possible to perform conditional time resolved photoemission measurements on an Emiscope II TRE instrument. The straight-forward action of swapping start and stop signal wires on the outside of the instrument, together with appropriate settings and an appropriate IC test, are sufficient to make this possible. The practical application of CTRE to IC debug was illustrated in two case studies: In the first case, the low rate of failure necessitated CTRE in order to make the failing behaviour inside the IC traceable. In the second case, the IC was deliberately placed into an intermittently failing state, to make it possible to identify the

first failing instruction in a piece of program code running on the IC. Together, the two cases demonstrate that CTRE is a practical and useful new technique in silicon debug and failure analysis. References [1] Vallet D. Picosecond Imaging circuit analysis – PICA microelectronics desk reference. 5th ed. ASM International; 2005. p. 369. [2] Evans RJ, et al. US Patent, 6172512, 2001. [3] Emiscope II instrument, supplied by DCG Systems Inc., Fremont, CA, USA. [4] Ortec 9308 Picosecond timing analyser, www.ortec-online.com. [5] Rowlette JA, Eiles TM. Critical timing analysis in microprocessors using near-IR laser assisted device alteration (LADA). In: Proc. ITC; 2003. [6] Mels A, Zachariasse F. Reduction of acquisition time for RIL, SDL and LADA. In: Proc. ISTFA; 2007. p. 6–14.