Automotive IC reliability: Elements of the battle towards zero defects

Automotive IC reliability: Elements of the battle towards zero defects

Microelectronics Reliability 48 (2008) 1459–1463 Contents lists available at ScienceDirect Microelectronics Reliability journal homepage: www.elsevi...

386KB Sizes 3 Downloads 111 Views

Microelectronics Reliability 48 (2008) 1459–1463

Contents lists available at ScienceDirect

Microelectronics Reliability journal homepage: www.elsevier.com/locate/microrel

Invited Paper

Automotive IC reliability: Elements of the battle towards zero defects Fred G. Kuper * NXP Semiconductors, Nijmegen, The Netherlands and MESA, Institute for Nanotechnology, University of Twente, The Netherlands

a r t i c l e

i n f o

Article history: Received 29 June 2008

a b s t r a c t The battle towards zero defects consists of fast response to PPM signals, prevention of incidents and continuous improvement. In this paper elements of all three branches are treated. A PPM analysis tool called quality crawl charts is introduced that enables prediction of customer complaint levels based on an early set of warranty call rate data. The fact that the automotive industry is very cautious with process and product changes can be better understood better with a given practical example of a small change with (in the eyes of automotive) big consequences. Finally it is shown that continuous PPM reduction activities also have an effect on the number of EOS/ESD customer returns, and that this category of fails form a shared responsibility for both supplier and customer. Ó 2008 Elsevier Ltd. All rights reserved.

1. Introduction Most of the activities in reliability are dedicated to wear-out. This is understandable, because in case wear-out sets in during normal life times, a company could face bankruptcy. However, the early part of the bathtub curve is also a very important part: it directly influences a consumer’s appreciation of a product if the probability of early fails is high. Both are very true for the automotive industry. The drive towards zero defects is extremely strong in the automotive industry. In cars, passion, technology and finance meet each other. The strive for quality is raised even more because failing components can directly lead to life threatening situations. The financial stakes are extremely high. As a consequence, the financial impact of claims and call back actions at an estimated cost of 1000 euros per car can be disastrous for any company active in the automotive industry, especially when the part that is unreliable cost much less than one dollar to produce. Another aspect that plays a role is that the amount of electronics is continuously increasing. The effect is that the improvement in reliability of parts is partly counteracted by the electronic content increase in cars. Electronic parts are used for ten years or more in cars platforms. To meet the yearly increasing quality requirements of integrated circuits, one cannot wait for next generations of products: the improvement needs be established with the present portfolio of products. To improve reliability of integrated circuits one needs to work on two tracks: prevention of incidents and continuous production improvement.

* Tel.: +31 243538171. E-mail addresses: [email protected], [email protected] 0026-2714/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.microrel.2008.06.026

Continuous improvement of IC’s that exhibit return rates in the one digit PPM range and below is hard to monitor, and it is therefore difficult to get feedback on the effectivity of improvement actions, let alone of the impact of changes. In this paper a method will be shown that can be used to effectively interpret small PPM variations of production quality based on early data on warranty call rates (WCR’s). An example of general PPM reduction is given for the field of EOS/ESD, a field that is typically in the gray area between producer caused and customer caused, will be shown as well. Incident prevention is usually accomplished by process control and very tight Maverick lot handling procedures, but even more by being extremely careful in all changes. Especially the latter appears to be very symptomatic for the automotive industry. One needs to consider that in the automotive industry an IC that causes a few PPM of cars to stop is already considered as an incident. In the last section of this paper a case will be shown that illustrates the background of the automotive fear to change. The results shown in this paper are obtained on commercial high volume transceiver IC’s. These in-vehicle-networking IC’s link modules of a car (engine control units, airbags, etc) to LIN (local interconnect network) or CAN (controllor area network) busses. The IC’s are relatively small (a few mm2), fabricated in SOI technologies and are packaged in standard small outline packages.

2. Assessing and predicting customer return trends The time after which failed products are returned to the manufacturer is close to 3 years in the automotive world, as illustrated in Fig. 1 where a cumulative distribution is given for the time between customer complaint and production date (the assembly date) for a typical product. These data can also be referred to as

1460

F.G. Kuper / Microelectronics Reliability 48 (2008) 1459–1463

were reached. We see here a combined effect of incident prevention and continuous improvement.

100% 90%

2.1. Modelling of quality crawl charts

field fails

60%

all combined

50% 40%

0 km

30% 20%

line fall-offf

10% 0% 0

26

52

78

104

130

156

Weeks Fig. 1. Time between the last process step (assembly) and the customer complaint for a high volume automotive product. The total is given, plus the three locations of origin of the complaint.

warranty call rate data, although technically warranty only applies to complaints from the field. In Fig. 1 it also seen that line fall-off fails (in the assembly line of the module maker) and 0-km fails (during assembly and test of a new vehicle) account for 40% of the customer returns of this product. These returns occur with a year after assembly. 60% are fails from the field. Field starts remarkably early (within weeks after assembly) and lasts till up to 3 years. The curve that we see here is in fact the first part of the bathtub curve. It is somewhat affected by storage times (which accounts for the fact that some parts are only assembled in modules after close to a year), but the shape of a constantly decreasing failure rate is clearly visible. The long time interval between first return and last return means that every month or quarter when an analysis is made of the returns data are obtained from a very long period. This makes analysis very difficult. The data can be analysed much more easily when all data are separated per production period and plotted in a cumulative way like shown in Fig. 2 for two production years and a time interval of a quarter. One observes that the shape of all these curves is similar. The main difference is the saturation level to which the curves tend. In this way the production quality can easily be evaluated. We call these quality crawl charts. As an example a quality crawl chart is shown of a typical high volume, mature, product. From this quality crawl chart it can easily be seen that 2006 was a year without incidents and that 2005 had one bad quarter, during which the PPM levels of the worst production quarters of 2004

PPM a.u.

2004 2005

A product fails when one component fails. It resembles a genuine weakest link situation. Therefore it is expected that the failure rate obeys Weibull statistics. It is also expected that the failure rate is a decreasing function. However, practice is more complicated than this: In Fig. 3 the failure rate, calculated as the added PPM in a quarter since production quarter is plotted as function of production quarter. One sees that the failure rate is a decreasing function and that the shape is constant. However, there are two deviating points to mention: first, the production quarter itself has a very limited number of returns. Secondly, the failure rate is not only decreasing, it even approaches zero. When one takes these numbers and transform them into Weibull graphs, as is shown in Fig. 4, it is clearly seen that there is a deviation from the expected straight line, towards a saturation level. The fact that the first quarter is deviant has to do with the fact that there is a variable amount of time between final production at the IC factory and assembly at the module maker. After the module maker there is another a waiting time till the module is assembled in a car.

Failure rate per quarter 100%

added PPM

70%

75%

50%

25%

0% 0

5

10

15

quarter after production Fig. 3. Failure rate in time, defined as added PPM in a quarter, normalized on its highest value.

ln(-ln(1-F)) [a.u.]

cum % of devices

80%

2006 0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16

quarter after production quarter

1

10

100

Quarter after production Fig. 2. Quality crawl chart: evolution of the number of customer complaints, expressed in PPM’s) in time for a number of production quarters of a typical high volume product.

Fig. 4. Weibull graph of the number of customer returns for the same quarters as shown in Fig. 3.

1461

F.G. Kuper / Microelectronics Reliability 48 (2008) 1459–1463

3. Continuous EOS/ESD return reduction Historically, the number of EOS/ESD returns accounts for 30– 50% of integrated circuit customer returns. We see that in the recent years the fraction of EOS/ESD returns is coming down to 30–40%. Apparently, next to all improvements that have been introduced in IC manufacturing, there appears to have has been a very substantial improvement in the EOS/ESD category as well. All IC’s are 100% functionally tested before they leave the factory. A returned device that is diagnosed to have EOS/ESD damage has usually a hard short between a pin and one of the supplies, which would always be detected. So the EOS/ESD the damaging event does take place at the customer’s site. This is why EOS/ESD customer returns are typically categorized as non-justified complaints. However, the question is whether the stress at the customer site is exceeding maximum sustainable ESD levels, or that a minute fraction of the devices may be more sensitive than others for the sometimes harsh conditions in assembly plants and cars, or have latent fails in ESD protections. Our assumption is that both cases apply. In some cases devices are grossly maltreated, in other cases a combination of stress and enhanced sensitivity may play a role. Therefore we have applied a two-sided approach to attempt to reduce the EOS/ESD specific PPM levels. In the first place, the contribution of customers is determined by commonality analyses using our traceability system, by which quickly can be checked whether there are certain customer locations that cause more returns (in PPM) than others. If a customer location is identified as returning significantly more than average PPM levels encountered, an investigation is started together with the customer to identify the suspect module, the suspect process step, and sometimes even the suspect car type or OEM production facility. In the second place there is a base level of customer returns originating from the whole customer base. We assume that this group contains the fraction of more sensitive products that have been stressed to levels above their (reduced) capability. If one wants to reduce this category, the question is therefore how one can screen deliveries for these sensitive products. As explained before [2,3], there are methods to identify slightly deviating products, for instance by making use of moving limits. For effective EOS/ESD screens, pins are identified that are typically seen as affected in EOS/ESD returns, and analogue tests are identified that would be influenced by an additional resistance path at

-4

-5

V2 [ V]

The time delay makes especially the first quarter unreliable for a Weibull plot. The fraction of fails is more or less stable, so it can be used for extrapolations. The apparent saturation can be explained in various ways. An explanation could be that the number of extrinsic parts is limited and that after some time these have been consumed. Another effect is that the biggest stress a device experiences is within the first part of its life. Once in a car, the stress induced damage causes some devices to fail, but the stress level itself is very low, hence the trend towards zero. Interestingly, this is in contrast with [1] in which consumer electronics WCR’s are studied. A third aspect is the warranty assessment. It is striking how well behaved and similar the three curves are. It can therefore be concluded that even though the curves suggest that Weibull is not the best way to interpret them, quality crawl charts can be used very well to analyse and predict incoming PPM levels. In this way the otherwise lagging parameter customer returns, can be made much more leading, which enables accurate prediction of production PPM levels after only a few months. This enables faster response to PPM signals.

-6

-7

-8 -8

-7

-6

-5

-4

V1 [ V] Fig. 5. Correlation between two V(I) tests of similar pins to identify outliers.

the typical EOS/ESD location. Once such a test has been identified, extremely narrow guard bands can be used to identify the outliers and screen them out. A second possibility is to make use of symmetries in the product in the neighbourhood of pins that are typically found in customer returns. For instance bus pins are much alike and can be used as an internal reference to each other as illustrated in Fig. 5. Outliers can easily be identified and de-selected. Analysis of the outliers found in this way did reveal that suspect locations are found exactly at the spot where customer returns are found to have failed as indicated in Fig. 6. In the latter case the damage is of course much bigger. This finding suggests an effectivity of this method which is also found in practice. As illustrated in Fig. 7 where EOS/ESD crawl charts of the first quarters of 4 years are shown, we have seen a reduction in the number of received customer complaints since the introduction of this kind of tests. The EOS/ESD level has come down to a point that the PPM level is very much affected by individual customers with ESD relevant application issues. This approach again proves that weak parts can effectively be screened out in this case even for EOS/ESD returns. 4. Small changes, big effects In this section, an example is given of a small change that caused a big incident. Note that in automotive a 1 digit PPM effect is already a big issue. In April 2006, a number of devices were returned with the complaint that the device could not start up after a long standby period. Analysis learned that in this particular IC, which was used for a number of years in very high volumes, a gate is left floating when Vcc is off and Vbat is on. It is explained in Fig. 8. When during this situation, the so called sleep mode, the gate becomes charged by a leakage current it disables the wake-up circuit. In fact, the Vbat line in the neighbourhood of the lines between testpad and the gates of ENMOST’s 1 and 2 can, in case a resistance of as small as a Gohm is present between the two lines can cause such a charge-up of the gates of the ENMOST’s. In case these ENMOST gates are charged they become conducting. The product then locks itself into the sleep mode since the voltage pulses on the bus cannot exceed the internal thresholds of the wake-up circuit anymore. It was until April 2006 never been an issue, but as the returns came in, we were looking at a level well above 1 PPM, which is with the high volumes involved a big problem. This is illustrated in the influx trend in Fig. 9 where it is seen that from April to June 2006 the influx was rising continuously. The accompanying quality crawl chart (made recently, so long after the problem was solved) is depicted in Fig. 10.

1462

F.G. Kuper / Microelectronics Reliability 48 (2008) 1459–1463

Fig. 6. Visual inspection of outlier, which reveals deviating spot on location known from customer returns.

2004Q1

returns [PPM]

2005Q1

Fig. 9. Monthly influx (number of returns) of ‘‘non-wake-up” complaints.

ppm

Fig. 7. Quality crawl charts of EOS/ESD returns of a high volume product.

ly em no ber ve m be r

10

ju

9

pt

8

se

7

ju ly em no ber ve m be 07 r ja nu ar m y ar ch m ay

6

pt

5

se

4

ar

3

quarter after production

m

2

ja

1

06

0

nu

ar

ch

y

2007Q1

m ay

# [a.u.]

2006Q1

0

1

2

3

4

5

6

7

8

Receive Quarter after Production 2005Q1

2005Q3

2006Q1

2006Q3

2005Q2

2005Q4

2006Q2

2006Q4

Fig. 10. Quality crawl charts of the design/process interaction category.

Fig. 8. Wake-up circuitry and the Vbat line. Point A indicates the sensitive spot.

Analysis of production dates of the customer returns (remember: for a very high volume product a few PPM already give a lot data to analyse) lead in first instance to a change in the measurement sequence in the test program.

It turns out that floating gates are very sensitive for measurement sequence: a previous measurement can charge a floating gate up of non-wake-up sensitive devices, which makes it fail in a subsequent test. However, when the test sequence was changed, there is no charging up, and therefore also less screening of this mechanism. The test program changes did not fully match with the occurrence of the increase of the wake-up phenomenon. It accounted for an increase, but not for the start. Detailed study on the onset of the issue and process changes learned that the root cause was the introduction of a primer layer

F.G. Kuper / Microelectronics Reliability 48 (2008) 1459–1463

below the resist. The resist was known to delaminate occasionally, which causes line yield loss. Introducing a primer was executed according to the normal change procedures during which it was assessed to be a minor change: the product would not change in any way and the production would become more robust. However, reconstruction learned that with the introduction of primer the probability that lacquer residues would redeposit on the surface increased considerably. These lacquer residues are very high resistive and will only conduct nanoAmps’ of current if they are between two metal lines, but in this special case it was enough to make a PPM number of products lock-up until the power was removed completely. This effect only affected this one product, as it was the only product sensitive for minute metal-to-metal leakage currents. The solution was, as usual, not difficult. Introduction of a dry-strip in the process removes all lacquer residues. To safeguard possible escapes, also a specific wake-up dedicated test was introduced, and finally a minor redesign was made available. No new non-wake-up cases have been recorded at all after introduction of the countermeasures. In my view this is an excellent example of how a very small change can have a PPM effect on a sensitive products, and illustrates therefore the attention the automotive industry pays to changes. 5. Conclusions Three elements of the battle towards zero defects have been illustrated in this paper. Quality crawl charts are introduced as a

1463

tool to change customer complaint data into a less lagging key performance indicator. Although returned fractions over time appear not to appear to follow Weibull distribution, they are very well behaved which allows for accurate extrapolations already after a limited time of return collection. It is also shown that by an approach that combines customer selection and state-of-the-art screening techniques considerable PPM improvements can be obtained in the reduction of EOS/ESD related PPM numbers. The extreme caution in the automotive industry for changes can be understood by a given example of the unforeseen effect of a small production change. Acknowledgements I would like to thank Mohammed Lemnawar and Marcel Rijnsburger of Business Line Automotive Safety and Comfort of NXP Semiconductors for numerous discussions and support. References [1] Ion RA, Petkova VT, Peeters BHJ, Sander PC. Field reliability prediction in consumer electronics using warranty data. Qual Reliab Eng Int 2007;23: 401–14. [2] Fang Liquan, Lemnawar Mohammed, Xing Yizi. Cost effective outliers screening with moving limits and correlation testing for analogue IC’s. Proc Int Test Conf 2007:31.201–10. [3] Xing Y. Defect-oriented testing of mixed-signal ICs: some industrial experience. Proc Int Test Conf 1998:678–87.