Defect spatial pattern recognition using a hybrid SOM–SVM approach in semiconductor manufacturing

Defect spatial pattern recognition using a hybrid SOM–SVM approach in semiconductor manufacturing

Available online at www.sciencedirect.com Expert Systems with Applications Expert Systems with Applications 36 (2009) 374–385 www.elsevier.com/locate...

225KB Sizes 0 Downloads 63 Views

Available online at www.sciencedirect.com

Expert Systems with Applications Expert Systems with Applications 36 (2009) 374–385 www.elsevier.com/locate/eswa

Defect spatial pattern recognition using a hybrid SOM–SVM approach in semiconductor manufacturing Te-Sheng Li a, Cheng-Lung Huang b

b,*

a Department of Industrial Engineering and Management, Ming Hsin University of Science and Technology, Hsinchu, Taiwan, ROC Department of Information Management, National Kaohsiung First University of Science and Technology, 2, Juoyue Rd., Nantz District, Kaohsiung 811, Taiwan, ROC

Abstract As manufacturing geometries continue to shrink and circuit performance increases, fast fault detection and semiconductor yield improvement is of increasing concern. Circuits must be controlled to reduce parametric yield loss, and the resulting circuits tested to guarantee that they meet specifications. In this paper, a hybrid approach that integrates the Self-Organizing Map and Support Vector Machine for wafer bin map classification is proposed. The log odds ratio test is employed as a spatial clustering measurement preprocessor to distinguish between the systematic and random wafer bin map distribution. After the smoothing step is performed on the wafer bin map, features such as co-occurrence matrix and moment invariants are extracted. The wafer bin maps are then clustered with the Self-Organizing Map using the aforementioned features. The Support Vector Machine is then applied to classify the wafer bin maps to identify the manufacturing defects. The proposed method can transform a large number of wafer bin maps into a small group of specific failure patterns and thus shorten the time and scope for troubleshooting to yield improvement. Real data on over 3000 wafers were applied to the proposed approach. The experimental results show that our approach can obtain over 90% classification accuracy and outperform back-propagation neural network.  2007 Elsevier Ltd. All rights reserved. Keywords: Wafer bin map; Defect diagnosis; Pattern classification; Semiconductor manufacturing; Self-organizing map; Support vector machine

1. Introduction As semiconductor device density and wafer size continue to increase, the volume of in-line and off-line data required to diagnose yield conditions is growing exponentially (Tobin, Gleason, Lakhani, & Bennett, 1997). To manage these data in the future, analysis tools will be required that can automatically reduce these data into useful information, assisting the process engineer in rapid root-cause defect diagnosis (Chien, Wang, & Cheng, 2007). High volume wafer fabrication facilities typically produce thousands of wafers per week. Each wafer may have hundreds to several thousands of chips. During the manufacturing processes, the fabricator has a large variety *

Corresponding author. Tel.: +886 7 6011000x4127; fax: +886 7 6011042. E-mail address: [email protected] (C.-L. Huang). 0957-4174/$ - see front matter  2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.09.023

of tools available to provide wafer inspection and measurements ranging from in-line optical defect detection, to off-line scanning electron microscopy (SEM), energy dispersive X-ray spectroscopy (EDX), and focused ion beam (FIM) for defect analysis and fault diagnosis. Upon exiting wafer fabrication, most companies perform electrical tests, often called probe tests. Regardless of the fabrication technology or wafer size, the probe operation creates and achieves extremely large data sets, whose primary function is to determine product acceptance or rejection. Similar to the optical defect map, the wafer bin map in many cases contains characteristic patterns, or signatures, that give insight into the health of the manufacturing process. A bin can be viewed as a ‘‘bucket’’ classification into which all of the dies that meet that classification fall. The wafer bin map is created for viewing by mapping the results of these electrical tests onto a 2-D space.

T.-S. Li, C.-L. Huang / Expert Systems with Applications 36 (2009) 374–385

Today, automatic defect classification (ADC), spatial signature analysis (SSA) and data mining approaches have demonstrated successful wafer bin map pattern recognition implementation. SSA is an automated procedure that has been developed by researchers at the Oak Ridge National Laboratory to address the issue of intelligent data reduction while providing timely feedback on current manufacturing processes (Duvivier, 1999; Gleason, Tobin, & Karnowski, 1998; Karnowski, Tobin, Jensen, & Lakhani, 1999). This method has been extended to analyze and interpret electrical test data and provide a pathway for correlation of this data with in-line measurements (Gleason, Tobin, Karnowski, & Lakhani, 1997). In the wafer bin map case, it has been demonstrated that any problem in the manufacturing process will be extensively linked with a particular two-dimensional spatial pattern, or signature. For example, a collection of electrical failures around the outside edge of the wafer may indicate a spin coater problem. A linearly oriented collection of failed dies may be indicative of a mechanical handling scratch imparted onto the wafer during processing. If these typical patterns can be automatically identified and linked to a specific manufacturing problem, this will accelerate the yield learning process by helping the field engineer quickly diagnose the problem. The scope of this paper is to present an innovative wafer-map analysis able to effectively recognize the spatial patterns of clustered defects in a probe test wafer bin map. To classify the binary bin map defect, a hybrid system of Self-Organizing Map (SOM) and Support Vector Machine (SVM) is introduced based on a spatial clustering measure and the log odd ratio (Taam & Hamada, 1993). This paper is organized as follows. Section 2 presents the related literature on automatic defect classification and spatial signature recognition. The methodology of the proposed hybrid approach is briefly described in Section 3. The experimental results are explained in Section 4, including WBM spatial feature extraction, feature clustering and multi-class defect pattern recognition using SOM and SVM. In Section 5 we draw our conclusions and indicate some directions for future research. 2. Literature review Over the past 30 years, researchers have constructed various mathematical models for illustrating and predicting wafer yield. In most of these studies the yield models were based on the Poisson distribution and developed to characterize the defect distribution as a function of the device size and complexity, allowing reliable capacity and cost estimates to be calculated. Poor predictions of this type stemmed from the fact that defect densities are subject to considerable spatial and temporal variations, effects that have been broadly described as defect clustering. In this section, we will briefly describe the previous papers on defect clustering and classification for wafer bin map data. Taam and Hamada (1993) proposed a log odds ratio to detect the spatial effects in the integrated circuit manufac-

375

turing process using a measure of spatial dependence process factors. Hansen, Nair, and Friedman (1997) illustrated that yields are an adequate measure if the defective IC’s are distributed randomly both within and across wafers in a lot. In practice, however, the defects often occur in clusters or display other systematic patterns. In general, these spatially clustered defects have assignable causes that can be traced to individual machines or to a series of process steps that did not meet specified requirements. They developed methods for routinely monitoring probe test data at the wafer-map level to detect the presence of spatial clustering. The statistical properties of a family of monitoring statistics are developed under various null and alternative situations of interest, and the resulting methodology is applied to manufacturing data. Wong (1996) proposed methodology to separate the product yield into two major components: a non-random systematic yield and random yield. The proposed methodology is capable of determining the yield impacts of sensitive parameters based on the calculated statistics of electrical test, and is also capable of identifying the causes of parametric yield losses. This approach was successfully applied to linking the relationship between yield and critical impact parameters and then translated the parametric problems into the process control problem. Friedman, Hansen, Nair, and James (1997) used binary probe test data at the wafer level to estimate the size, shape and location of large-area defects or clusters of defective chips. This approach makes extensive use of the location of failing chips to directly identify clusters. In addition, by directly estimating the spatial signature of clustered defects, this approach can provide engineers with a greater understanding of the process control associated with the root-cause analysis. Lee, Song, and Sang (2001) applied data mining techniques by designing an intelligent in-line measurement sampling method for process parameter monitoring in a wafer fab. In order to effectively detect all process parameter abnormalities, this approach extracts spatial defect features from the historical wafer bin map data and then clusters similar cluster features using the SOM neural network. Moreover, this approach merges homogenous clusters using a statistical homogeneity test and then selects the chip location with the best existing bin detection power via interactively exploring SOM weighted vectors. Chen and Liu (2000) employed a neural network architecture named the adaptive resonance theory network 1 (ART 1) to detect wafer spatial defect patterns. He also linked the results as clues for identifying equipment and process variations. The result was also compared with another unsupervised neural network, the self-organizing map (SOM). This shows that this approach can correctly and effectively recognize similar defects. Skinner et al. (2002) compared various statistical methods that used probe test data to determine the cause of low yield wafers. These methods included two traditional multivariate methods, clustering and principal components method and regression-based approaches. These traditional methods were compared to a classification and regression tree

376

T.-S. Li, C.-L. Huang / Expert Systems with Applications 36 (2009) 374–385

(CART) method. The result showed that CART adequately fitted the data and provided a favorable recipe that avoided low yield because CART is distribution-free and there are no assumptions about the data distribution properties. Miguelanez, Zalzala, and Tabor (2003, 2004) combined a genetic algorithm as a feature selector and a RBF neural network as a classifier for e-binmap defect classification. Several features were extracted from the test stage, including mass, moments and invariant moments. This method’s performance could reach up to 87% correct e-binmap classification rate. In addition, they also employed the filtering algorithm to discard those wafermaps without systematic patterns and then introduced the particle swarm optimization algorithm to build a RBF neural network that could solve the defect classification problem. Palma, Nicolao, Miraglia, Pasquinetti, and Piccinini (2005) compared the unsupervised neural network classifier, namely, SOM and ART1, to validate and recognize the spatial pattern on a wafer. They concluded that ART1 was not adequate, whereas SOM provided completely satisfactory results including a visually effective representation of the pattern’s spatial probability classes.

NGB = number of joins among king-move neighbors that connect a good and a bad chip on a wafer (represented as ‘‘B’’), NBG = number of joins among king-move neighbors that connect a bad and a good chip on a wafer and NBB = number of joins among king-move neighbors that connect two bad chips on a wafer. The log odds ratio quantifies spatial dependence and independence through the 3 · 3 or 5 · 5 mask calculation for smoothing the spatial effects by our pilot experiments. A positive value indicates an attraction of like outcomes, whereas a negative value indicates a repulsion of them. A near 0 value represents near independence (random effect). Thus, the h sign indicates the type of spatial dependence (attraction or repulsion) and the magnitude measures the degree of dependence. After separating the systematic pattern of bad chips, the remaining step will extract the pattern features through the co-occurrence matrix and invariant moments. ^h ¼ ðN GG þ 0:5ÞðN BB þ 0:5Þ ðN GB þ 0:5ÞðN BG þ 0:5Þ

ð1Þ

3.2. Feature extraction 3. Methodology The methodology of this paper has been integrated into an intelligent system able to recognize clustered defect spatial patterns through spatial statistics, self-organizing map (SOM) neural network and support vector machine (SVM) for wafer bin map (WBM) pattern classification. Wafer bin map contains in many cases characteristics that give insight into the health of the semiconductor manufacturing process. Bin data describes the electrical testing results for individual wafer dies. A bin can be view as a ‘‘bucket’’ classification into which all of the dies that meet that classification fall. The wafer bin map is created by mapping the mapping results from these electrical tests onto a 2-D space (Miguelanez et al., 2003). In the following, we will briefly illustrate the methods employed in this hybrid approach. 3.1. A measure of spatial clustering We employed log odds ratio (Eq. (1)) as a measure of spatial clustering. This spatial measure can be identified in the spatial effects in such processes and parameters that influence the spatial clustering of functional or non-functional chips on the wafer-map (Taam & Hamada, 1993). Other discussions about spatial statistics may be found in Palma et al. (2005). A wide range of bin classification is employed during the testing process. We focus on the binary bin map. That is, the wafer bin map data is binary, namely, ‘‘1’’ indicates a good die (chip) and ‘‘0’’ indicates a bad die (chip). Thus, the dependences among dies can be measured using join-count statistics. Three spatial dependencies for binary responses are given by NGG = the number of joins among king-move neighbors that connect two good chips (represented as ‘‘G’’) on a wafer,

Feature extraction is a procedure that computes new variables that in one way or another originally form the stored values from the wafer bin map. One of the main goals in this study is to generate the features from a given wafer bin map that will subsequently be fed to a classifier to classify the wafer bin map in one of the possible fault classes. Therefore, these features should efficiently encode the relevant information residing in the original data. In our problem, the features extracted from the wafer bin map for subsequent classification are, energy, entropy, contrast, local homogeneity extracted from co-occurrence matrix and seven invariant moments. For a wafer bin map whose test value at coordinates x and y is f(x, y), the features are described below (Mudigonda, Rangayyan, & Desautels, 2000). ð1Þ Entropy : f E ¼ 

Ng X Ng X i

Cði; j; d; hÞ log½Cði; j; d; hÞ

j

ð2Þ ð2Þ Energy : f En ¼

Ng X

Ng X

i

j

ð3Þ Contrast : f con ¼

C 2 ði; j; d; hÞ

Ng X Ng X 2 ði  jÞ Cði; j; d; hÞ i

ð3Þ ð4Þ

j

ð4Þ Local Homogeneity : f LH ¼

Ng X Ng X i

j

1 1 þ ði  jÞ

2

Cði; j; d; hÞ

ð5Þ

where C(i, j, d, h) defined as the distribution of the probability of occurrence of a pair of grey level (i, j) by a given distance d and angle h.

T.-S. Li, C.-L. Huang / Expert Systems with Applications 36 (2009) 374–385

PP

y f ðx; yÞ N PP x y xf ðx; yÞ ð6Þ Centroid : X c ¼ P P x y f ðx; yÞ x

ð5Þ Mass : M ¼

ð6Þ PP x y yf ðx; yÞ Yc ¼ P P x y f ðx; yÞ

PP p q x y x y f ðx; yÞ ð7Þ Geometric moments : mpq ¼ P P x y f ðx; yÞ ð8Þ Central moments : XX p q gpq ¼ ðx  xc Þ  ðy  y c Þ  f ðx; yÞ

ð7Þ ð8Þ

ð9Þ

In this study, the seven invariant moments of Hu (1962) are also considered that are invariant under the actions of translation, scaling and rotation. U1 ¼ g20 þ g02

ð10Þ 2

U2 ¼ ðg20 þ g02 Þ þ 2

4g211

U3 ¼ ðg30 þ 3g12 Þ þ ðg03  3g21 Þ

2

U4 ¼ ðg30 þ g12 Þ2 þ ðg03  g21 Þ2 j k 2 2 U5 ¼ ðg30  3g12 Þðg30  g21 Þ ðg30 þ g12 Þ  3ðg21 þ g03 Þ j k 2 2 þ ðg03  3g21 Þðg03 þ g21 Þ ðg03 þ g21 Þ  3ðg12 þ g30 Þ j k U6 ¼ ðg20  g02 Þ ðg30 þ g12 Þ2  ðg21 þ g03 Þ2 þ 4g11 ðg30 þ g12 Þðg03 þ g21 Þ j k 2 2 U7 ¼ ð3g21  g03 Þðg30 þ g12 Þ ðg30 þ g12 Þ  3ðg21 þ g03 Þ j k 2 2 þ ðg30  3g12 Þðg03 þ g21 Þ ðg03 þ g21 Þ  3ðg12 þ g30 Þ where the gpq are the central moments defined as above Eq. (9). The first six of these moments are also invariant under the action of reflection, while the seventh index changes sign. The values of these quantities can be very large and different. In practice, in order to avoid precision problems, the logarithms of their absolute value may be taken and passed on to the classifier as features. The invariance of these features is an advantage when wafer bin maps are analyzed with signature classes that do not depend on scale, location, or angular position. 3.3. Unsupervised neural network SOM Neural networks are known for their predictive capability and ability to learn patterns from real data that are noisy, imprecise and incomplete. Among the many neural network models presented, the SOM is the one that is most suitable for unsupervised applications because it has the special property of effectively creating spatially organized ‘‘internal representations’’ of various input data features and provides a topology preserving mapping from the high-dimensional space into two-dimensional grid maps (Lee et al., 2001). The SOM was developed by Kohonen (1989). Since then, many authors have done a lot of research into its application and theory. It has demon-

377

strated its efficiency in various engineering problems (Kohonen, Raivio, Simula, Venta, & Henriksson, 1990), including clustering, pattern recognition, the reduction of dimensions and the extraction of features. In addition, SOM has been applied in the visualization of complex processes and systems and discovery dependencies and abstractions from raw data in semiconductor manufacturing. SOM applications to the testing phase in the backend section, related with wafer bin map sampling design, clustering, and defective pattern recognition, has been under intense investigation during the past decades (Lee et al., 2001). 3.4. Support vector machines The supervised neural network has the task of discriminating over the state space where the class decision boundaries are complex or ambiguous. The network should be a local approximation type and should incorporate formalism in its design for obtaining adequate generalization performance. One of the neural network models that fulfill these requirements is support vector machine (SVM). The SOM prototype vectors create piecewise linear class boundaries (Kohonen, 1997) that are usually not effective for resolving the class ambiguity over all the regions of the state space. Moreover, even if the learning procedure enlarges the SOM adaptively until each of its neurons represent unambiguously a single class, this solution addresses only the minimization of the training error and ignores the generalization performance. In the absence of a formal setting for designing generalization, the decision boundaries that the SOM constructs for the ambiguous regions are not expected to cope well for discriminating new patterns. To the contrary, a SVM implementation of the supervised learning offers the potential to construct near perfect decision boundaries. Below we discuss the supervised network implementation with a BP and with SVM, emphasizing the later choice, which provides better generalization performance and disciplined design. The SVM obtains high generalization performance without the need to add a priori knowledge even when the input dimension space is high. Moreover, this is a model that allows more accurate formal assessment of the generalization performance. This fits well within the framework of this proposal. Below we attempt a rigorous assessment of the generalization performance of the supervised network. We also provide a methodology for SVM model selection well fitted to the number of transferred patters (i.e., model order selection). However, some notations and key concepts should be defined first. 3.4.1. Optimal separating hyperplane Let {(xi, yi), i = 1, . . . , N} be a training example set S, each example xi 2 Rn belongs to a class labeled by yi 2 {1, 1}. The goal is to define a hyperplane which divides S such that all the points with the same label are on the same side of the hyperplane while maximizing the

378

T.-S. Li, C.-L. Huang / Expert Systems with Applications 36 (2009) 374–385

a

¼ w

N X

ð15Þ

ai y i xi ;

i¼1

w

while b can be determined from a and from the Ku¨hn– Tucker conditions (Bertsekas, 1989) P2

b

Origin

P1

ai ðy i ð w  xi þ bÞ  1Þ ¼ 0;

i ¼ 1; 2; . . . ; N :

ð16Þ

By the way, the corresponding training examples (xi,yi) with non-zero coefficients ai are called support vectors. At last, the decision function of classifying a new data point x can be written as ! N X ai y i xi  x þ b : ð17Þ f ðxÞ ¼ sgn

Margin

b w

i¼1

3.4.2. Linearly non-separable case If the set S is not linearly separable, we must introduce N non-negative variables n = (n1, n2, . . . , nN) such that

P2

b ξ Origin

P1

y i ðw  xi þ bÞ P 1  ni ;

Margin

Fig. 1. Classification by an SVM. Separating hyperplane of (a) linearly separable case and (b) linear non-separable case. The dashed lines identify the margin and the support vectors lie on two planes, P1 and P2. The optimal separating hyperplane lies between and parallel with P1 and P2.

distance between the two classes (see Fig. 1). This means to find a pair (w, b) such that y i ðw  xi þ bÞ > 0;

i ¼ 1; . . . ; N

ð11Þ

where w 2 Rn and b 2 R. The pair (w, b) defines a separating hyperplane w  x þ b ¼ 0:

ð12Þ

If there exists a hyperplane satisfying, the set S is said to be linearly separable and we can change w and b so that y i ðw  xi þ bÞ P 1;

i ¼ 1; . . . ; N

ð13Þ

Accordingly, we can know the minimal distance between the closest point and the hyperplane is 1/kwk. Besides, among these separating hyperplanes, the OSH is a hyperplane, which the distance to the closest point is maximal. Hence, in order to find the OSH, we must minimize kwk2 under constraint Eq. (13). Since kwk2 is convex, we can minimize it under constraint Eq. (13) by means of the classical method of Lagrange multipliers. If we denote with a = (a1, a2, . . . , aN) the N non-negative Lagrange multipliers associated with constraint Eq. (13), the problem of finding OSH is equivalent to the maximization of the function W ðaÞ ¼

N X i¼1

ai 

N 1X ai aj y i y j xi  xj ; 2 i;j¼1

ð14Þ

P where ai P 0 and under constraint Ni¼1 y i ai ¼ 0. a2 ; . . . ;  aN Þ solution has been Once the vector  a ¼ ð a1 ;  found, the OSH ð w;  bÞ has the following expansion:

i ¼ 1; 2; . . . ; N :

ð18Þ

These variables are called slack variables and the proposal of them are to allow misclassified points which have their corresponding ni > 1. Hence, the generalized OSH is then regarded as the solution of the following minimizing problem: N X 1 wwþC ni ; 2 i¼1

ð19Þ

where C is a regularization parameter. When the parameter C is small, the OSH tends to maximize the distance 1/kwk; on the contrary, a larger C will lead the OSH to minimize the number of misclassified points (see Fig. 1). 3.4.3. Nonlinear support vector machines For the linearly non-separable training set, the input data can be mapped into a high-dimensional feature space first. Then the OSH is constructed in the feature space. If U(x) denotes a mapping function that maps x into a high-dimensional feature space as follows: W ðaÞ ¼

N X

ai 

i¼1

N 1X ai aj y i y j Uðxi Þ  Uðxj Þ: 2 i;j¼1

ð20Þ

Now, let K(xi, xj) = U(xi) Æ U (xj), we can rewrite above equation as W ðaÞ ¼

N X

ai 

i¼1

N 1X ai aj y i y j Kðxi ; xj Þ; 2 i;j¼1

ð21Þ

where K is called a kernel function and must satisfy Mercer’s theorem (Vapnik, 1995). Finally, the decision function becomes: ! N X ai y i Kðxi ; xÞ þ b : ð22Þ f ðxÞ ¼ sgn i¼1

T.-S. Li, C.-L. Huang / Expert Systems with Applications 36 (2009) 374–385

Typical kernel functions are the following: Linear Kernel Kðx; zÞ ¼ x  z:

ð23Þ

Polynomial Kernel d

Kðx; zÞ ¼ ðc  x  z þ coef Þ ;

4. Experimental results ð25Þ

where c is a constant. Sigmoidal Neural Network Kernel Kðx; zÞ ¼ tanhðc  x  z þ coef Þ;

ð26Þ

where c and coef are constants. 3.4.4. Multiclass support vector machines Two main approaches have been suggested for applying SVMs to multiclass classification. One is the one-against-all strategy to classify between each class and all the remaining; the other is the one-against-one strategy to classify between each pair. In each, the underlying basis has been to reduce the multiclass problem to a set of binary problems, enabling the basic SVM approach to be used. In the one-against-all approach, a set of binary classifiers, each trained to separate on class from the rest, is undertaken, and the input vector allocated to the class for which the largest decision value was determined (Hsu & Lin, 2002). The ith SVM is trained from the training samples where some examples contained in the ith class have ‘‘+1’’ labels, and other examples contained in the other classes have ‘‘1’’ labels. Specifically, with this approach, where n is the number of classes ðwi ÞT UðxÞ þ bi ;

i ¼ 1; . . . ; n:

ð27Þ

The data x then belong to the class, for which the above decision function has the largest value, i.e., class of x  arg max ððwi ÞT UðxÞ þ bi Þ: i¼1;...;n

features and selected as the inputs of the hybrid (SOM and SVM) approach for training samples and testing samples. A more detail demonstration for the proposed hybrid SOM–SVM approach can be found in the following experimental section.

ð24Þ

where c and coef are constants and d is a degree. Gaussian Radial Basis Kernel Kðx; zÞ ¼ expðc  jx  zj2 Þ;

379

ð28Þ

The second method of reducing a multiclass problem to a series of binary ones to enable the application of the basic SVM model for multiclass classification is the ‘‘oneagainst-one’’ approach. In this approach, a series of classifiers is applied to each pair of classes, with the most commonly computed class label kept for each input. The application of this method requires n(n  1)/2 classifiers or machines be applied to each pair of classes and a strategy to handle instances in which an equal number of votes are derived for more than one class for a pattern (Hsu & Lin, 2002). Once all n(n  1)/2 classifiers have been undertaken, the max-win strategy is followed. Specifically, if sgn((wjl)TU(x) + bjl) evaluates x to be in jth class, then the vote for the jth class is incremented by one; else that for the lth class is increased by one. Finally, the data vector x is predicted to belong to the class with maximum number of votes. Fig. 2 shows the wafer bin map pattern recognition architecture. Specific patterned samples are extracted the

4.1. Data sources The data were collected during a six month period from July 2004 to December 2004 for a logic IC product produced by a well known semiconductor company in the Hsin-Chu industrial park, Taiwan. The main source data consisted of 3573 wafers with 751 dies through the circuit probing process which described the electrical testing results for individual die on a wafer. The wafer bin map is created by mapping the results from these electrical tests onto a 2-D space. The wafer bin map contains in many characteristics that give insight into the health of the semiconductor manufacturing process. Typically, the semiconductor products are manufactured on a lot-by-lot basis because each wafer in a lot has identical manufacturing parameter settings and environmental conditions. Therefore we performed the pre-processing operation on the electrical test data to determine which position should be indicated as a good chip (labeled as 0) or bad chip (labeled as 1). Based on the several years of experience, the process engineer sets the threshold value as 0.16, which means that the same position in each lot (25 wafers) is labeled ‘‘bad’’ with more than 4 bad chips. Thus, we employed the stacked wafer as an analysis unit, that is, the 3573 wafer bin map data were transformed into 140 stacked wafer bin maps as the analysis base due to the lot-by-lot basis in semiconductor manufacturing environment. 4.2. Log odds ratio test We first employed the log odds ratio as a measure of spatial clustering which quantified the spatial dependence or independence in terms of the corresponding join-count statistics. The log odds ratio estimate value divides the spatial dependence into three groups: attraction with a positive value, repulsion with a negative value and near independence with a value near 0. Although the sign of the measure indicates the type of spatial dependence (attraction or repulsion), its magnitude also measures the degree of dependence. The results show that there are no negative values or near zero values after calculation completion. In other words, all of the measures are positive values (from 0.5 to 1.9) which indicate the spatial clustering of the same kind of chips. In order to distinguish this situation, we used the p-value equal to 105 as the significant level to test the clustering significance. As a result, 125 lots of 140 wafer lots showed the spatial clustering condition. To effectively distinguish the wafer-map characteristics we

380

T.-S. Li, C.-L. Huang / Expert Systems with Applications 36 (2009) 374–385

Problem description

Data preprocessing

Calculate spatial statistics Random

Systematic Pattern? Others

Randomized defects

No

Clustered? Yes Feature extraction

Feature enhancement/ Noise deletion

SOM clustering BP classifier

Wafer Bin Map Database

SVM classifier

Classification results Fig. 2. Flow chart of WBM classification.

also conducted a filtering process and reduced the wafermap noise.

Table 1 Co-occurrence matrix example

4.3. Feature selection

Wafermap-1 Wafermap-2 Wafermap-3

The features used in this study were selected from the cooccurrence matrix and 2-dimensional invariant moments. There were eleven features selected in this study. The 2-D invariant moments do not need to consider the related parameters, whereas the four features selected from the co-occurrence matrix should consider the interval (number of pixel) and angle. Consequently, the operating characteristic curves for these four features are very similar to each other in terms of the interval and angle. The number of pixels was set to d = 1 and the angle was set to 0 degrees. Table 1 shows three images of a wafer-map from map 1 through map 3. Accordingly, the contrast and entropy values are increasingly associated with the degree of wafer-map complication. The energy value and local homogeneity are decreasingly associated with the degree of complication. Thus, maximizing the contrast and entropy values and minimizing the energy and local homogeneity values better demonstrate the feature characteristics.

Energy

Contrast

Entropy

Local homogeneity

0.12 0.11 0.07

1.44 2.31 2.99

3.411 3.417 3.844

1.72 1.69 1.63

4.4. SOM clustering The choice of the right number of classes in an unsupervised network is an open problem. In fact, increasing the number of clusters will usually lead to an improvement in the chosen measure due to the identification of a new cluster or a split in a true cluster (Palma et al., 2005). Herein, we conduct the unsupervised neural network SOM combined with SVM classifier for validation. The input data for training the SOM were 125 wafer-map lots with 11 features. The 2-D feature map output will be 3 · 3 and 4 · 4 dimension. The SOM parameters were set as follows: iteration: 10000; initial learning rate: 1; learning rate: 0.9; radius of neighborhood: 0.9; minimized radius: 1. Table 2 gives a summary of the classifier results for 3 · 3 SOM feature map. There are 7 clusters illustrated in the

T.-S. Li, C.-L. Huang / Expert Systems with Applications 36 (2009) 374–385 Table 2 Feature map 3 · 3 clustering distribution Cluster 1 2 3 4 5 6 7 Total

# Of lots 10 2 10 5 30 23 45

Characteristic of clustering Edge effect; up, bottom, left and right Edge effect; bottom and up Semi-ring effect; left to up Edge effect; up, left and right 3/4 ring effect Semi-ring effect Ring effect

125

table with the characteristics of each clustering. Two main patterns shown in Fig. 3 and Table 2 are the edge and ring effects. However, the percentage of ring effects is over 86% (108/125), much greater than the percentage of edge effects, 14% (17/125). Thus, it is important to reveal this information to the process engineer for diagnosis. Table 3 illustrates a summary of the classifier results for a 4 · 4 SOM feature map. There are fourteen clusters shown in the table showing the characteristics of each clustering. The main patterns shown in Fig. 4 and Table 3 are the edge, ring, semi-ring and scratch effects. However, the pattern of clusters 5, 6, 8, 9 and 12 are similar to each other, shown with a semi-ring defects distribution. The pattern cluster 4 is almost the same as that of cluster 10. The 4 · 4 SOM clustering results are more detailed than those for the 3 · 3 SOM clustering. The clustering categories are divided into ring (27.2%), 3/4 ring (22.4%), semi-ring (20.8%), quarter ring (6.4%), edge (16%), and scratch effects (7.2%). Indeed, the 4 · 4 SOM clustering results can provide more detailed information to the process engineer for specific consideration.

381

Lin, 2001) software to validate the classification capability of the proposed hybrid SOM–SVM approach. Although the generalization ability of the SVM is relatively robust to variations in the parameter settings, the method used to define the parameter c and cost parameter C in the kernel function RBF will ensure that high accuracy is obtained. Accordingly, the parameters are set in the range of c = [24, 23, . . . , 210], C = [212, 211, . . . , 22] using 15 · 15 combinations for testing the classification accuracy. These experiments were conducted using the holdout and kfold-cross-validation methods via k equal to 5 and 10. 4.5.1. One-against-all method First we conducted the one-against-all multi-class classification method using the holdout method. That is, all of the data were divided into a training data set with 84 records and a test data set with 41 records based on 11 features. Tables 4 and 5 present the performance for 7 and 14 clusters obtained from the previous SOM clustering. Obviously, except for the accuracy listed on clusters 6 and 7 in Table 4, the others were over 95% accuracy with some up to 100% classification accuracy with respect to the different parameter settings. When we traced back the wafer map from the original records, clusters 6 and 7 exhibited more defects and inconsistent defect distributions around the edge than the others lots. We also conducted the k-foldcross-validation to evaluate the performance of the proposed method. The k values were 5 and 10 listed in Tables 6 and 7. The k-fold-cross-validation performance was consistent with both k = 5 and k = 10 under 7 and 14 clusters. The classification accuracies do not show the significant difference from the numerous of parameter settings. Generally, the classification performance using k = 10 is higher than that of k = 5.

4.5. SVM classification In the previous section, we obtained two groups of clusters (i.e. 7 clusters and 14 clusters) created by an unsupervised SOM neural network. The support vector machines were constructed with LIBSVM version 2.6 (Chang &

#1

#2

#5

4.5.2. One-against-one method Secondly, we conducted the one-against-one multi-class classification method using the holdout method and kfold-cross-validation. As mentioned before, each cluster was divided into a 2/3 training data set and a 1/3 test data

#4

#3

#6

#7

Fig. 3. Wafer bin maps for seven spatial clustering.

382

T.-S. Li, C.-L. Huang / Expert Systems with Applications 36 (2009) 374–385

Table 3 Feature map 4 · 4 clustering distribution Cluster

# Of lots

Characteristic of clustering

Cluster

1 2 3 4 5 6 7

6 2 9 23 2 12 6

Quarter ring, up and left Quarter ring, up and right Edge effects Ring effects Semi-ring, up Semi-ring, up Edge effects, up and bottom

8 9 10 11 12 13 14

# Of lots 3 3 11 5 6 9 28

Total

#1

#2

#8

#9

Characteristic of clustering Semi-ring, up Semi-ring, up Ring effects Edge effects, random Semi-ring, up Scratch effect, left to up 3/4 quarter ring

125

#3

#4

#5

#10

#11

#12

#6

#7

#13

#14

Fig. 4. Wafer bin maps for 14 spatial clustering.

Table 4 One-against-all method results (Holdout method, 7 clusters)

Table 6 One-against-all method results (cross-validation method, 7 clusters)

Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster

C 1 2 3 4 5 6 7

c

10

2 29 29 29 26 212 212

1

2 20 21 20 21 20 20

Accuracy (training) (%)

Accuracy (test) (%)

K

Cluster

100 100 100 100 100 100 100

100 100 100 100 97.56 87.80 87.80

5 5 5 5 5 5 5

Cluster Cluster Cluster Cluster Cluster Cluster Cluster

1 2 3 4 5 6 7

Accuracy (%)

K

Cluster

94.40 100.00 96.00 94.40 92.80 95.20 96.00

10 10 10 10 10 10 10

Cluster Cluster Cluster Cluster Cluster Cluster Cluster

Accuracy (%) 1 2 3 4 5 6 7

95.20 100.00 96.80 95.20 93.60 95.20 95.20

Table 5 One-against-all method (Holdout method, 14 clusters)

Table 7 One-against-all method (cross-validation method, 14 clusters)

Cluster

C

Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14

2 24 26 27 26 24 22 22 22 21 22 22 22 20

c 0

2 21 21 25 21 22 20 20 20 20 20 20 20 23

Accuracy (training) (%)

Accuracy (test) (%)

K

Cluster

100 100 100 100 100 100 100 100 100 97.62 100 100 100 100

100 100 97.56 97.56 100 95.12 100 100 100 95.12 100 100 100 97.56

5 5 5 5 5 5 5 5 5 5 5 5 5 5

Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Accuracy (%)

K

Cluster

Accuracy (%)

100 99 94 97 99 95 100 98 99 95 96 97 96 96

10 10 10 10 10 10 10 10 10 10 10 10 10 10

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12 Cluster 13 Cluster14

100 99.20 94.40 96.80 99.20 96 100 98.40 99.20 96.80 96 96.80 95.20 96.80

T.-S. Li, C.-L. Huang / Expert Systems with Applications 36 (2009) 374–385

set with 11 features. The Gaussian parameters used a c set between 24 and 211 and C between 22 and 212. The classification accuracy of the one-against-one reached 100% using 7 clusters. However, Table 8 presents the classification accuracies for 14 clusters that did not reach 100% accuracy in the test data set. Tables 9 and 10 also show the classification accuracies that did not reach the 100% accuracy for k-fold-cross-validation using 7 and 14 clusters. The k-fold-cross-validation performance was consistent with both k = 5 and k = 10 under 7 and 14 clusters. Note that the lower performance lots consisted of fewer lots in the same clusters. This means that fewer training data sets may not be unsuitable for SVM classification. However, the classification accuracies did not show the significant difference from the numerous parameter settings. Generally, the classification performance using k = 10 is higher than that of k = 5. 4.6. SOM–SVM comparison with the SOM–BP Table 11 displays the average classification accuracy results evaluated using the three multiple-classifiers using the 7 and 14 cluster holdout methods. The SVM classifier, either the one-against-one or one-against-all classifier, yields the better classification performance than that of the BP neural network. In the wafer-map defect classification case, we can claim that the SVM outperforms the BP neural netTable 8 One-against-one method results (Holdout method, 14 clusters) Classes 1 1 2 2 2 5 5 7 7 7 7 9

vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.

C 2

2 8 5 7 9 7 9 8 9 11 12 11

2 27 24 21 24 28 25 25 21 20 23 23

c 1

2 23 23 21 20 24 22 21 20 22 21 22

Accuracy (train) (%)

Accuracy (test) (%)

100 100 100 100 100 100 100 100 100 100 100 100

50 0 0 50 0 50 0 0 50 25 50 50

Table 9 One-against-one method (cross-validation method, 7 clusters) K

Cluster

Accuracy (%)

K

Cluster

Accuracy (%)

5 5 5 5 5 5 5 5 5 5 5

1 1 1 1 3 3 4 4 5 5 6

90.00 97.50 69.70 89.09 90.00 96.36 96.43 98.00 90.57 98.67 97.06

10 10 10 10 10 10 10 10 10 10 10

1 1 1 1 3 3 4 4 5 5 6

90.00 97.50 66.67 92.73 97.50 96.36 96.43 98.00 90.57 98.67 97.06

vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.

3 5 6 7 5 7 6 7 6 7 7

vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.

3 5 6 7 5 7 6 7 6 7 7

383

Table 10 One-against-one method (cross-validation method, 14 clusters) K

Cluster

Accuracy (%)

K

Cluster

Accuracy (%)

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

1 vs. 2 1 vs. 5 1 vs. 8 1 vs. 9 1 vs. 11 2 vs. 3 2 vs. 4 2 vs. 5 2 vs. 6 2 vs. 7 2 vs. 8 2 vs. 9 2 vs. 10 2 vs. 11 2 vs. 12 2 vs. 13 2 vs. 14 3 vs. 4 3 vs. 5 3 vs. 6 3 vs. 8 3 vs. 9 4 vs. 5 4 vs. 8 4 vs. 9 4 vs. 10 4 vs. 11 4 vs. 12 4 vs. 13 5 vs. 6 5 vs. 7 5 vs. 8 5 vs. 9 5 vs. 10 5 vs. 11 5 vs. 12 5 vs. 13 5 vs. 14 6 vs. 8 6 vs. 9 6 vs. 10 7 vs. 8 7 vs. 9 8 vs. 9 8 vs. 10 8 vs. 11 8 vs. 12 8 vs. 13 8 vs. 14 9 vs. 10 9 vs. 11 9 vs. 12 9 vs. 13 9 vs. 14 10 vs. 14 11 vs. 12 11 vs. 13 12 vs. 13 12 vs. 14 13 vs. 14

83.33 83.33 71.43 83.33 40.00 88.89 95.65 0.00 91.67 83.33 66.67 0.00 95.45 83.33 85.71 90.00 96.55 60.00 88.89 84.21 80.00 88.89 95.65 91.67 95.65 97.67 81.48 92.86 96.77 91.67 83.33 66.67 0.00 95.45 83.33 85.71 90.00 96.43 84.61 91.67 93.75 71.43 83.33 33.33 91.30 71.43 75.00 81.82 93.33 95..45 66.67 85.71 90.00 96.55 97.96 90.91 85.71 93.33 94.12 97.30

10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

1 vs. 2 1 vs. 5 1 vs. 8 1 vs. 9 1 vs. 11 2 vs. 3 2 vs. 4 2 vs. 5 2 vs. 6 2 vs. 7 2 vs. 8 2 vs. 9 2 vs. 10 2 vs. 11 2 vs. 12 2 vs. 13 2 vs. 14 3 vs. 4 3 vs. 5 3 vs. 6 3 vs. 8 3 vs. 9 4 vs. 5 4 vs. 8 4 vs. 9 4 vs. 10 4 vs. 11 4 vs. 12 4 vs. 13 5 vs. 6 5.vs. 7 5 vs. 8 5 vs. 9 5 vs. 10 5 vs. 11 5 vs. 12 5 vs. 13 5 vs. 14 6 vs. 8 6 vs. 9 6 vs. 10 7 vs. 8 7 vs. 9 8 vs. 9 8 vs. 10 8 vs. 11 8 vs. 12 8 vs. 13 8 vs. 14 9 vs. 10 9 vs. 11 9 vs. 12 9 vs. 13 9 vs. 14 10 vs. 14 11 vs. 12 11 vs. 13 12 vs. 13 12 vs. 14 13 vs. 14

83.33 83.33 71.43 83.33 40.00 88.89 95.65 0.00 91.67 83.33 66.67 0.00 95.45 83.33 85.71 90.00 96.55 63.33 88.89 89.47 80.00 88.89 95.65 91.67 95.65 97.67 77.78 92.86 96.77 91.67 83.33 66.67 0.00 95.45 83.33 85.71 90.00 96.43 84.61 91.67 90.63 71.43 83.33 33.33 91.30 71.43 75.00 81.82 93.33 95.45 66.67 85.71 90.00 96.55 97.96 100 57.14 93.33 94.12 97.30

work due to its higher generalization capability for classification. Accordingly, the strong point of the presented work

384

T.-S. Li, C.-L. Huang / Expert Systems with Applications 36 (2009) 374–385

Table 11 BP and SVM Comparison Method

One-against-one One-against-all BP

Clusters 7 clusters (Train)

7 clusters (Test)

14 clusters (Train)

14 clusters (Test)

100% 100% 86.90%

100% 96.17% 85.37%

100% 99.83% 52.38%

94.78% 98.79% 46.34%

is the framework that it provides for designing computationally efficient solutions. 5. Conclusion This work proposed a novel hybrid approach combining the supervised SVM classifier with the unsupervised SOM clustering for binary bin defect pattern classification. This model exploits the potential of SOM combined with the SVM classifier to obtain higher classification accuracies. The first stage corresponds to the SOM feature map and can be performed roughly to determine the number of neurons or clusters in a two-dimensional SOM lattice. In the second stage, higher generalization performance should be enforced and explicitly designed. Supervised training networks capable of achieving good generalization performance (e.g. SVM and BP) are more appropriate for this task. We developed the SOM combined with SVM as a supervised classifier using the holdout and k-fold-cross-validation methods. We compared this work with the SOM associated with the BP supervised classifier. The results showed that the SOM combined with the SVM achieves higher classification performance using either the oneagainst-all or one-against-one multi-classifier. Moreover, the proposed model outperforms the SOM combined with the BP classifier. Note that the main objective of using the unsupervised SOM associated with the supervised SVM classifier is to obtain significant computational benefits in large scale problems. It is difficult to directly approach certain large problems with nearly optimal models, such as the SVM, due to the computational complexity of the numerical solution. This study only focuses on the binary bin map classification. Since a wide range of bin classification is employed during the testing process, this bin categorization spectrum of the test process could raise a future research of the proposed approach. References Bertsekas, D. P. (1989). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice-Hall. Chang, C. C., & Lin, C. J. (2001). LIBSVM: A library for support vector machines. Software available from http://www.csie.ntu.edu.tw/~cjlin/ libsvm. Chen, F. L., & Liu, S. F. (2000). A neural-network approach to recognize defect spatial pattern in semiconductor fabrication. IEEE Transactions on Semiconductor Manufacturing, 13(3), 366–373.

Chien, C.-F., Wang, W.-C., & Cheng, J.-C. (2007). Data mining for yield enhancement in semiconductor manufacturing and an empirical study. Expert Systems with Applications, 33(1), 192–198. Duvivier, F. (1999). Automatic detection of spatial signature on wafermaps in a high volume production. In Proceedings of the 14th international symposium on defect and fault-tolerance in VLSI systems, Albuquerque, NM, USA (pp.61–67). Friedman, D. J., Hansen, M. H., Nair, V. N., & James, D. A. (1997). Model-free estimation of defect clustering in integrated circuit fabrication. IEEE Transaction on Semiconductor Manufacturing, 10(3), 344–359. Gleason, S. S., Tobin, K. W., Karnowski, T. P., & Lakhani, F. (1997). Rapid yield learning through optical defect and electrical test analysis. In SPIE’s international symposium on microlighography, Santa Clara Convention Center, Santa Clara, CA, USA. Gleason, S. S., Tobin, K. W., & Karnowski, T. P. (1998). Rapid yield learning through optical defect and electrical test analysis. In SPIE’s metrology, inspection, and process control for microlithography XII, Santa Clara Convention Center, Santa Clara, CA, USA. Hansen, M. H., Nair, V. N., & Friedman, D. J. (1997). Monitoring wafer map data from integrated circuit fabrication processes for spatially clustered defects. Technometrics, 39(3), 241–253. Hsu, C. W., & Lin, C. J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425. Hu, M. K. (1962). Visual pattern recognition by moments invariants. IRE Transactions on Information Theory, 8(2), 179–187. Karnowski, T. P., Tobin, K. W., Jensen, D., & Lakhani, F. (1999). The application of spatial signature analysis to electrical test data: Validation study. In Metrology, inspection, and process control for microlithography XIII, Proceedings SPIE (Vol. 3677, pp. 530– 541). Kohonen, T. (1989). Self organization and associative memory (3rd ed.). Berlin: Springer-Verlag. Kohonen, T. (1997). Self-ogranized maps. New York: Springer-Verlag. Kohonen, T., Raivio, K., Simula, O., Venta, O., & Henriksson, J. (1990). Combining linear equalization and self-organizing adaptation in dynamic discrete-signal detection. In Proceedings of the international joint conference on neural networks, San Diego, USA (Vol. 1, pp. 223– 228). Lee, J. H., Song, J. Y., & Sang, C. P. (2001). Design of intelligent data sampling methodology based on data mining. IEEE Transactions on Robotics and Automation, 17(5), 637–649. Miguelanez, E., Zalzala, A. M. S., & Tabor, P. (2003). DIWA: Device independent wafermap analysis. In The congress on evolutionary computation, IEEE Computer Society, Canberra, Australia (Vol. 2, pp. 823–829). Miguelanez, E., Zalzala, A. M. S., & Tabor, P. (2004). Evolving neural networks using swarm intelligence for binmap classification. In IEEE world congress on evolutionary computation, Portland, USA (Vol. 1, pp. 978–985). Mudigonda, N. R., Rangayyan, R., & Desautels, J. E. L. (2000). Gradient and texture analysis for the classification of mammographic masses. IEEE Transactions on Medical Imaging, 19(10), 1032–1043. Palma, F. D., Nicolao, G. D., Miraglia, G., Pasquinetti, E., & Piccinini, F. (2005). Unsupervised spatial pattern classification of electrical-wafersorting maps in semiconductor manufacturing. Pattern Recognition Letter, 26(12), 1857–1865. Skinner, K. R., Montgomery, D. C., Runger, G. C., Fowler, J. W., McCarville, D. R., Rhoads, T. R., & Stanley, J. D. (2002). Multivariate statistical methods for modeling and analysis of wafer probe test data. IEEE Transactions on Semiconductor Manufacturing, 15(4), 523–530. Taam, W., & Hamada, M. (1993). Detecting spatial effects from factorial experiments: An application from integrated-circuit manufacturing. Technometrics, 35(2), 149–160. Tobin, K. W., Gleason, S. S., Lakhani, F., & Bennett, M. H. (1997). Automated analysis for rapid defect sourcing and yield learning.

T.-S. Li, C.-L. Huang / Expert Systems with Applications 36 (2009) 374–385 Future Fab International, Issue 4 (Vol. 1, pp. 313). London: Technology Publishing Ltd.. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer-Verlag.

385

Wong, A. Y. (1996). A statistical parametric and probe yield analysis methodology. In Proceedings of the IEEE international symposium on defect and fault tolerance in VLSI systems, Boston, MA, USA (pp. 131– 139).