Fault detection using bispectral features and one-class classifiers

Fault detection using bispectral features and one-class classifiers

Journal of Process Control 83 (2019) 1–10 Contents lists available at ScienceDirect Journal of Process Control journal homepage: www.elsevier.com/lo...

3MB Sizes 0 Downloads 32 Views

Journal of Process Control 83 (2019) 1–10

Contents lists available at ScienceDirect

Journal of Process Control journal homepage: www.elsevier.com/locate/jprocont

Fault detection using bispectral features and one-class classifiers Xian Du Department of Mechanical and Industrial Engineering, University of Massachusetts, Amherst, MA 01002, USA

a r t i c l e

i n f o

Article history: Received 2 February 2019 Received in revised form 29 July 2019 Accepted 23 August 2019 Keywords: Fault detection Bispectrum One-class classification Process quality monitoring

a b s t r a c t Smart manufacturing systems are equipped with various sensors to collect engineering variables for monitoring of manufacturing process quality, such as fault detection. Multiple fundamental difficulties, however, arise in the fault detection including high nonlinearity, high dimensionality, unequal batch length, unsynchronized batch trajectory and imbalance in the ratio of fault data and normal data in the sensor time series recordings. To capture the dynamics, nonlinearity and non-Gaussianity in the recordings, we design a feature extraction method using bispectral features and Principal Component Analysis (PCA); to address the class-imbalance issue, we use one-class classifiers. We evaluate the proposed method using recall, precision, G-means and Area Under ROC Curve (AUC) on a publicly available benchmark of a real semiconductor etching dataset for fault detection. It is demonstrated that bispectral features processed further by PCA can significantly improve the fault detection accuracy and lower the false alarm rate of one-class classifiers. The k-Nearest Neighbor (k-NN) can achieve the 100% fault detection rate and zero false alarm in the 10-fold cross validation test. Published by Elsevier Ltd.

1. Introduction Innovation and advancement in computers, internet technologies, and sensors has propelled the acquisition of a massive amount of machine data for manufacturing process monitoring and control. Finding a way to accurately detect faults using the multimodal and multidimensional time-series data is a necessary but challenging task. For data-driven fault detection in the industrial environment, more comprehensive and detailed studies are still needed to solve the nonlinear and non-Gaussian problems for large-scale industrial processes (Yin 2014). Univariate statistical techniques, such as Shewhart Statistical Process Control (SPC), Cumulative Sum (CUSUM) and Exponentially Weighted Moving Average (EWMA) charts have long been used for monitoring of manufacturing process quality. These methods are effective and easy to be used for univariate manufacturing processes. However, for multivariate process data, these methods often cause many false alarms (type I error) and undetected faults (type II error) due to the highly correlated sensor measurements [1]. Moreover, most statistical methods have an assumption that normal manufacturing processes have Gaussian and linear characteristics, while many manufacturing processes have the characteristics of unequal batch and step length, unsynchronized batch trajectory and non-Gaussian data distribution. For example, the process

E-mail address: [email protected] https://doi.org/10.1016/j.jprocont.2019.08.007 0959-1524/Published by Elsevier Ltd.

operations in the semiconductor industry are governed by physical and chemical principles, such as mass and energy balances, which generally results in highly correlated sensor measurements [2,3]. These high-dimensional measurements can be nonlinear, non-Gaussian, possess multimodal batch trajectories and have deliberately adjusted durations [4–6]. Various data preprocessing methods, including trajectory length alignment by trimming or warping, unfolding and feature extraction, have been performed to adapt the data to the methods. Recently, a variety of pattern classification techniques have been modified to address the limitation of traditional fault detection techniques in manufacturing processes, such as multiway Principal Component Analysis (MPCA) and Support Vector Machine (SVM). These methods have proved to be very successful in continuous process applications such as petrochemical processes and traditional chemical batch processes. However, their linear and Gaussian characteristics hinder the fault detection when these methods are applied to the processes that have nonlinearity in most batch processes. Meanwhile, conventional data preprocessing methods simply unfold the time-series data into a 2-D data array before pattern classification can be applied. For example, one data preprocessing method simply blends time-dimension and featuredimension such that the dynamics and batch process information that characterizes the variability of the process data are removed, and the fault detection rates of the unfolding data are limited. Moreover, conventional classifiers treat fault detection as a binary classification problem using both normal and fault labeled data in

2

X. Du / Journal of Process Control 83 (2019) 1–10

Fig. 1. Flow chart of fault detection.

the training process, which ignores the imbalance ratio between normal and fault data in the real process and hence can fail to identify some faults. Inspired by the solution to the similar problems in the classification of EEG signals [7,8], we quantify the irregularity of the time series data of a manufacturing process for fault detection by the measures of their regularities, using the magnitude and entropy of the relative power over a frequency range. To capture the dynamics, nonlinearity and non-Gaussianity in the recordings, we design a feature extraction method using two bispectral features and PCA; to address the class-imbalance issue, we use one-class classifiers. It is found that the Principal Components (PCs) of the bispectral features can significantly improve the fault detection accuracy and lower the false alarm rate of one-class classifiers. The one-class kNN classifier can achieve the best fault detection rate and zero false alarm among our tested classifiers. As shown in Fig. 1, we follow a conventional pattern classification and recognition process for the data-driven fault detection. In Section 2, we present the state of the art on classification techniques used for fault detection in manufacturing processes including data preprocessing, feature extraction and one-class classifiers. In section 3, we present a batch-wise unfolding method based on bispectral features. Section 4 describes the implementation details for the classification techniques used in our evaluation as well as the evaluation criterion and the data set. In Section 5 we present the experimental results. Finally, we discuss the results. 2. Related work To deal with the imbalanced data in fault detection, researchers designed and tested many one-class classification or self-learning techniques to improve the data-driven fault detection rate. The idea is to describe the normal manufacturing process using a one-class classifier. The faulty data will be detected as outliers.

Wang et al. [4] proposed a self-supervised learning approach for fault detection. Their approach, based on Kernel PCA, can detect faulty data by identifying the high reconstruction error in the feature space. Lv et al. [6] proposed an unsupervised deep multilayer neural network to characterize the higher-order correlation within faulty features through the composition of nonlinearities. These statistics were used to analyze the correlations hierarchically for fault identification. In literature, the popular one-class classifiers tested for fault detection include MPCA [2], k-NN ([9–11]), Support Vector Data Description (SVDD) [4,12,13], and Gaussian Mixture Model (GMM) [14,15]. In parallel with the development of these classifiers, a great number of data preprocessing and data transformation techniques have been proposed in the recent years to improve the classification performance. To solve the nonlinear fault detection problem, different kinds of kernel, such as Gaussian kernel [4], have been proposed to make a nonlinear transformation and map input data into a feature space that allows the application of linear algorithms. Kernel-based PCA [4], PLS [16] and SVM [17] have been intensively studied to deal with nonlinearity issues and non-Gaussian distributed variables in industrial processes [18,19]. However, selecting an appropriate kernel and the proper setting of its parameters are important for a high-performance classifier, because the structure underlying of data is totally determined by the kernel matrix. To adopt an unequal 3D multivariate time series dataset into a 2D input matrix for pattern classifiers, three conventional data unfolding methods were designed in literature [20]. Before unfolding, the sample durations (see Fig. 2(a)) are trimmed or warped to the same size (see Fig. 2(b)). The 3D data can be unfolded into 2D data by batch-wise unfolding that combines time and variables into one dimension (see Fig. 2(d)), or by variable-wise unfolding that fuses sample and time into one dimension (See Fig. 2(e)). The batch-wise unfolding is considered appropriate for data description methods under the assumption that it can catch most process characteristics along the sample trajectory; the variable-wise unfolding is more appropriate when the number of samples is small because it can provide more training and testing data for data description modeling [13]. The batch-wise unfolding can keep batch spe-

Fig. 2. Three conventional data unfolding methods for batch process dataset in literature. (a) 3D original batch data with unequal sample duration; (b) 3D array after trimming or warping; (c) unfolding result by statistics pattern analysis [21]; (d) batch-wise unfolding; (e) variable-wise unfolding. Notations: I - {sample i}; J - {variable j}; Ki -the length of sample i; K - trimmed duration.

X. Du / Journal of Process Control 83 (2019) 1–10

cific information separate from time and variable information. It arranges the history of each variable together in the matrix. However, it dramatically increases the number of variables such that the reliability of classifiers requires enough batches to accurately characterize the process. The variable-wise unfolding orders the measurements taken at the same time together. However, this preprocessing misses the continuity and batch trajectory information. Additionally, both methods need to trim or warp the raw data to obtain equal length sample records. He and Wang [21] proposed a statistical pattern analysis framework that integrates the variable and time dimensions into one dimension of statistics patterns by calculating various batch statistics. Their proposed statistical patterns include four groups of batch statistics: mean, covariance matrix, skewness and kurtosis of all variables. The four groups of statistics are assembled in one matrix with the dimensions of J × (J+3). Vectorizing this matrix into a row vector for each sample, a 2D ‘sample × statistics’ data matrix, with the size of I × (J × (J+3)) can be used for batch pattern recognition. Using the symmetry of the covariance matrix, the size of statistic pattern matrix of each sample can be reduced to (J × (J+7))/2. Though the authors claimed that the final statistics matrix is usually smaller than the unfolded data matrix used in the MPCA method, its high dimensions were not reduced significantly compared to MPCA. Meanwhile, the statistics used in their research [21] did not capture the process dynamics, nonlinearity and nonnormality in the time and frequency domains underneath the time series data. The authors suggested tailoring their batch statistics by cumulants and polyspectra. However, tailoring the statistics will increase the size of the statistics matrix and exaggerate the high-dimensional data computation. Instead of tailoring the statistics matrix, we propose to use the irregularity features of each variable of the series time process data, described in frequency domain, for pattern classification. Such irregularity features have been successfully used to measure the regularity of EEG time series signals for quantification and classification of rhythmic and irregular EEG signals ([7,8]). Similarly, these frequency-domain features, such as bispectral features, have been used by many researchers in industrial condition monitoring and fault detection ([22–25]). These features contain the dynamic information of a process such as its deviation from Gaussianity and linearities, the phase estimation of non-Gaussian parametric signals, and detection and characterization of the nonlinear properties of mechanisms which generate time series via phase relations of their harmonic components [26]. The thirdorder spectrum, also called bispectra, is the Fourier transform of the third-order cumulant sequence. As the simplest among the high order spectral features, bispectrum has received most attention in the literature as a tool for fault feature extraction mainly because of its insensitivity to Gaussian noises, which is the characteristic for most of normal signals, and its ability to preserve phase information and detect quadratically phase-coupled signals. Many researchers ([22,23,25]) applied bispectrum analysis to fault detection of bearings. They demonstrated that the bispectral features are effective in fault diagnosis in the bearings of induction motors, gearboxes, and in flexible rotor systems. Tian et al. [24] extracted fault features using spectral kurtosis and cross correlation. The features were further combined into a health index using PCA and a semi supervised k-NN distance measure. They demonstrated that the method can accurately detect incipient faults under masking noise by thresholding the health index of incoming data. Besides bispectral features, there are many other spectral features have been used in process monitoring and analysis. In machining process monitoring, power spectral density was widely used for chatter detection [27]. It is well known that the chatter

3

frequency occurs near a dominant structural frequency. Thus, the most common approach to chatter detection is to investigate the spectral density of a process signal and develop a threshold value that indicates chatter. In condition monitoring in machinery, spectral features have been extracted from the signals including cutting force, acoustic emission (AE), vibration, and current. Among these spectral signal analyses, vibration-based fault diagnosis is a wellestablished field that includes a wide range of techniques that have rapidly evolved during the last decades. Conventional frequency features, including mean, standard deviation, the root mean square frequency, peak magnitude, energies, and ratios of spectral energies, are used for fault diagnosis in pumps, motors and gearboxes [28]. Envelope analysis (EnA) is used especially in bearings to identify bearing defect characteristic frequency. In vibration-based fault diagnosis, frequency features were broadly used for wear detection and type identification based on the strong correlation between vibration frequency and tool wear states [29,30]. In addition to Fourier analysis, wavelet spectral analysis also shows effective in analyzing non-stationary machining sensor signals. It shows great potential in detecting abrupt changes of tool conditions. Additionally, spectral signal analysis has been used for condition monitoring and fault detection of wind turbines [31]. The fault detection systems evaluate measured process data (e.g., using spectral analysis for vibration monitoring) to isolate incipient faults at a very early stage, i.e. before they manifest optically or acoustically. This way, a plant can often be repaired before the faulty component actually fails and possibly causes damage to other parts of the plant. 3. Batch-wise unfolding based on bispectral feature extraction We propose to extract bispectral features of every time-series measurement of each sample. Then, each bispectral feature of all variables will be put side by side in a matrix in the same order as the variables of raw data. As shown in Fig. 3, we form a new matrix with the size I × (J × H). Here H is the number of bispectral features, and H = 2 in this paper. Using bispectral features extracted from time series data, the dynamics, nonlinear and non-Gaussian information can be kept for each measurement in the 2D unfolding matrix. Meanwhile, such feature extraction-based unfolding regularizes the dataset without trimming any raw data and hence, can keep the complete data information for its description. Note: He and Wang [21] proposed an unfolding statistics matrix with the size of I × (J × (J + 7)/2) that is much larger than our proposed bispectral feature matrix that has the size of I × (J × 2). 3.1. Bispectrum Bispectrum is the third order spectrum. The bispectrum is calculated by: B (f1 , f2 ) = E [F (f1 ) F (f1 ) F ∗ (f1 + f2 )]

(1)

Here the bispectrum is evaluated at two independent frequencies f1 and f2 . F(f) denotes the discrete-time Fourier transform of the signal f, E [.] denotes the expectation operation, and the symbol * denotes the complex conjugate. The frequency f is normalized by the Nyquist frequency to 0–1 range. The bispectrum quantifies the skewness of a signal. The amplitude of the bispectrum at the bifrequency (f1 , f2 ) measures the amount of coupling between the spectral components at the frequencies f1 , f2 , and f1 + f2 (see Eq. (1)). A significant frequency component coupling indicates that a quadratic non-linearity exists in the signal. Furthermore, it indicates the sensitivity of the bispectrum to nonsymmetric nonlinearity.

4

X. Du / Journal of Process Control 83 (2019) 1–10

Fig. 3. Batch-wise unfolding based on bispectral features. (a) A 3D original batch dataset with unequal sample duration; (b) unfolding result by bispectral features.

Additionally, bispectral entropies can be derived from bispectrum plots to find the rhythmic nature of time-series data such as the variability and irregularity of the signals. The equations used to determine the various spectral features are given below. The mean magnitude of the bispectrum is calculated by: Mean mag =

1 L

   B (f1 , f2 )

(2)

Normalized bispectral entropy (BE) is:



Ent = −

pn log pn

(3)

n

where  L is thenumber  of pointswithin the region (f1 , f2 ), and B (f1 , f2 ) . In the equations, (f1 , f2 ) is any pn = B (f1 , f2 ) / ˝ paired coupling component. The entropy is normalized to the 0–1 range (required of a probability). The entropy measures the uniformity of a power spectral distribution. A signal with a single frequency component has the smallest entropy, while a signal with all frequency components of equal power value (white noise) has the greatest entropy. The power spectrum of a measured signal is dominated by a few peaks with low values in normal variations. The entropy of the power spectrum depends mainly on the degree of dominance of a few peaks, the number of the peaks, and their peakedness. A stronger entropy value indicates a greater degree of signal irregularity. 4. Fault detection based on one-class classification methods Many one-class classification techniques have been developed for imbalanced two-class data. We mainly evaluate the following popular one-class classifiers: PCA-T2 , PCA-SPE, k-NN, and SVDD.

T2 = XTnew P-1 P T Xnew

The PCA-based fault detection methods had an assumption that data samples are obtained from the process operating in unimodal with linear correlated variables and follow a multidimensional Gaussian distribution with unknown mean and covariance. PCA transformation converts the variances of a set of data into a set of values of linearly uncorrelated PCs. These PCs make up an orthogonal basis set, called eigenvectors. For a given data matrix X with m rows of data points and d columns of features, the covariance matrix of X is calculated by: XTX m−1

(4)

Prior to calculation in Eq. (4), the columns of X have been ‘mean centered’ by subtracting the mean of each column and autoscaled

(5)

␭−1

is a diagonal matrix containing the inverse eigenvalues Here corresponding to the k eigenvectors in P. SPE measures the lack of model fit by:





SPE = Xnew 1 -PP T XTnew

(6)

The T2 statistic is a multivariate extension of t-statistic. When we use a cluster of data points in d dimensional space to present a normal process, T2 is the squared distance from the center of the data cluster to the noise in the normal process. The bound of the cluster, like some hyper ellipse, presents an upper control limit (UCL) at a given confidence interval for T2 . The UCL can be calculated by: UCLT 2 ,˛ =

4.1. PCA

cov (X) =

to unit variance by dividing the standard deviation of each column, such that the PCA result will not be skewed due to the difference between features. Next, a matrix V, with d columns for the d dimensional eigenvectors, can be extracted from the autoscaled and centered covariance matrix by singular value decomposition. Each eigenvector corresponds to an eigenvalue. Ranking of the eigenvalues allows for the selection of the most significant eigenvalues and their corresponding orthogonal directions (denoted by eigenvectors) to retain a high percentage of data variance while removing noises and disturbances in the raw data. Assume only k eigenvectors in the matrix V are retained in a d × k matrix P. For X and P coming from a known manufacturing process, any new process data Xnew can be projected to subspace P to test its fit in the PCA model. The Hotelling’s T2 and squared prediction error (SPE) are usually used for the model fit test. Hotelling’s T2 measures the variation within the PCA model by:

d (m − 1) F m − d 1−˛,d,m−d

(7)

Here F1-˛,d,m-d is the F statistic with a confidence interval of (1-˛). For the SPE, an approximation based on the weighted chi-square distribution can be used to get the upper SPE control limit at significance level ␣ by [18]: UCLSPE,˛ =

v  2 2b (2b /v),˛

(8)

Here b and v are sample mean and variance of the SPE sample, and (2b2 /v),˛ is the chi-squared distribution with a confidence interval of (1-˛). 4.2. k-NN The k-NN classifier labels a sample by calculating its similarity with samples in the training data. Given a training target data X,

X. Du / Journal of Process Control 83 (2019) 1–10

5

the k nearest neighbors for each sample can be found. Next, the squared distance for each sample can be calculated, as shown [9], by: Di2 =

k 

dij2

(9)

j=1

Here dij2 is the squared Euclidean distance from sample i to its jth nearest neighbor. Given a confidence level (1-˛), the k-NN method labels a new sample data Xnew by:



h (Xnew ) =

2 normal, ifD2Xnew ≤ D˛

(10)

outlier, otherwise

2 can be estimated using the Matlab function The threshold DX new chilimit in the PLS Toolbox. Another practical solution is based on estimation by using the Di2 value that has (1-˛) calibration samples below it. In this paper, we use the later.

4.3. SVDD SVDD fits a hypersphere, described by the center a and radius R, around the target class. Given training data of target class X, SVDD method optimizes the two parameters of the sphere and slack variables , i ≥ 0, by minimizing the structural error:





ε R, a,  = R2 + C

d 

i

(11)

Fig. 4. Flow chart of proposed fault detection method using ‘HOS + PCA’ and oneclass classifiers.

‘HOS + PCA’ data will be used to train the one-class classifiers; the testing ‘HOS + PCA’ data will be used for fault detection.

i=1

with the constraints that bounds target sample data within the sphere with a confidence level (1-˛):

  |X − a |2 ≤ R2 + 

(12)

Here C is a constant that offers the tradeoff between the volume of the sphere and the errors. This optimization problem can be solved by incorporating Eq. (12) into Eq. (11) by introducing Lagrange multipliers and by quadratic programming. For a confidence level (1-˛), the k-NN method labels a new sample data Xnew by: h(Xnew ) = {

normal, if||Xnew − a|| ≤ R2 outlier,

.

(13)

otherwise

The optimization solution using Lagrangian can be solved by formulating the problem completely as an inner product of objects (xi , xj ). However, researchers [13] generally map the normal data xi onto a suitable feature space for the optimization, as objective data are normally not spherically distributed as described in Eq. (12). An ideal kernel function can map the normal data onto a bounded, spherically shaped area in the feature space while keeping the outlier objects outside this area. Mathematically, all inner products (xi , xj ) in the calculation process of Eqs. (11 and 12) are replaced by a proper kernel function K(xi , xj ) to find the desirable description boundaries. Various kernel functions have been proposed for the SVDD such as polynomial kernel and the Gaussian kernel. It was shown that the Gaussian kernel can give a tight description of the training data. The width of the kernel, as a free parameter, can tune the Gaussian kernel. 4.4. Workflow of the proposed fault detection method As shown in Fig. 4, the proposed fault detection method consists of training one-classifier (left part of Fig. 4) and testing fault detection (right part of Fig. 4). For each data, two HOS features are extracted by Eqs. (2) and (3). Then eigenvectors are extracted from the normal HOS training data by PCA. The training and testing data are projected in the normal eigenvector space. The normal

5. Case study In this section, we evaluate the performance of the proposed fault detection methods using a publicly available manufacturing process dataset. For the imbalanced dataset, a number of classification/evaluation criteria from confusion matrix are used in evaluation. 5.1. Data set We evaluate the classification techniques on real manufacturing data from a semiconductor metal etch process [2,3]. The semiconductor metal etch process data were acquired from an Al stack etch process performed on Lam 9600 plasma etch tool. The TiN/Al-0.5% Tu/TiN/oxide stack with an inductively coupled BCl3 /Cl2 plasma. The key parameters of interest are the linewidth of the etched Al line, uniformity across the wafer and the oxide loss. The machine state sensors built into the metal etcher, collect machine data during wafer processing. The machine data consist of 40 process setpoints, and 19 measured and controlled variables for gas flow rates, chamber pressure and RF power. These process variables are listed in Table 1. The machine data were sampled at 1 s intervals in normal etching process. Additionally, a series of three fault experiments were performed by tuning the power, pressure, and gas flow parameters. 129 wafers were included in the machine data, with 108 normal wafers and 21 wafers with intentionally induced faults. The 56th wafer is excluded from the normal wafers because it was misprocessed (outlier) and has very few runs compared with other normal wafers. Due to the large amount of missing data in faulty wafer 12, only 20 faulty batches are tested in the present work. Additionally, the metal etch dataset has unequal batch duration. Like many other batch processes, in the etch data set, different batches have different durations. Among the 107 normal batches and 20 fault batches, for example, their batch durations respectively range from 95 s to 112 s and from 92 s to 106 s.

6

X. Du / Journal of Process Control 83 (2019) 1–10

Fig. 5. Illustration of the metal etch process. (a) End-point AL traces (variable 5 in Table 1) from five different batches; (b) Estimate of the bispectrum of a trace in (a).

Fig. 5(a) shows five batch trajectories of the process variable ‘End-point’. It demonstrates the unequal sample duration and misaligned sample trace in the semiconductor etch process. Fig. 5(b) demonstrates contour plot of the bispectrum that is calculated by Eq. (1). The data used in this example consists of quadratic phase-coupling that occurs due to nonlinear interactions between harmonic components. The presence of pronounced peaks in the bispectrum of Fig. 5(b) is the indicative of nonlinear phenomena. [32]. Fig. 5 also shows a no-Gaussian process according to Hinich’s testing for Gaussianity and linearity testing [33]; if the bispectrum is not zero, then the process is non-Gaussian. The non-Gaussianity of the data in Fig. 5 can be confirmed by its high average values of kurtosis (−4.1 × 1010 ) and skewness (4.7 × 106 ). 5.2. Evaluation criteria The performance is mostly measured by the fault detection rate (1 - type I error) and false alarm rate (type II error). As faulty data is much less numerous than normal data in manufacturing processes, false alarm rate is necessarily an important measurement for evaluating the performance of a classification method, though the true fault detection rate is expected to be as high as possible. The performance of one-class classifiers is normally evaluated by G-means, and the area under the ROC curve (AUC) to deal with such imbalanced datasets. First, we use the criteria from the confusion matrix. Four possibilities of data classification results are included: true positive (TP), false positive (FP), true negative (TN) and false negative (FN). TP rate (TPR), also called sensitivity or recall, and FP rate (FPR) are defined as follows: TPR =

#TP #TP +#FN

FPR = 1-

#TN #TN +#FP

(14) (15)

Using TPR and FPR, the ROC can be created to evaluate the performance of classification algorithms. An ROC curve shows the relationship between benefits of TPR and costs of FPR as the decision threshold varies. The increase of TPR is always accompanied by the increase of FPR. Moreover, we can calculate the AUC by trapezoidal approximation to quantify their overall classification performance. AUC equal to 0.5 indicates a random classification result, while AUC equal to 1.0 indicates a perfect classification accuracy. A higher AUC value indicates a better classifier performance. In this paper, positive and negative data respectively refer to faulty and normal data. Generally, there is much less faulty data

than normal data in fault detection of a manufacturing process. To measure the performance of both classes in a fair way, we calculate the sensitivity, specificity and geometric mean metrics, suggested as follows: G-means =



Sensitivity × Specificity

(16)

#FP Here Sensitivity = Recall, and Specificity = 1 - #Negatives .

5.3. Experimental results 5.3.1. Data preprocessing The raw data are obtained by batch-wise unfolding (see Fig. 2d). Before unfolding, each batch time-series data is trimmed by removing the first five and cut off the last several sample points to keep each time series at the length of 87 sample points. Using Eqs (1–3), two bispectral features are extracted from each variable of the raw data and assembled into a two- dimensional matrix (refer to Fig. 2). The entropy and magnitude features are extracted for each variable of every sample. For each feature, the transformed normal and fault data are respectively reconstructed into two 107 × 19 and 20 × 19 matrixes. Then, the PCA method is applied to each of the matrixes to achieve two 107 × 19 and 20 × 19 matrixes for every feature. The PCA extracted matrices for the two features are stacked along the feature dimension resulting in a 107 × 38 matrix and a 20 × 38 matrix respectively for normal and fault data. For each of the five paired matrixes, a box plot is shown in Fig. 6. Overall, most of the bispectral features show higher variation for fault data than normal data; entropy data show higher variation than magnitude data. Fig. 6(a) and (b) show that the bispectral features of the normal and fault data exhibit some correlation. The PCA project significantly removes the correlation (as shown in Fig. 6(c–e)). The performance is further verified by Table 1, where we use a two-sample t-test to estimate whether the mean value of each bispectral feature for the two classes is significantly different. Here the null hypothesis is that the two means have no significant difference. A p-value is the probability that a difference between the two means is significant. Table 2 shows, for p-value < 0.05, magnitude has more variables and PC components to show significant difference between normal and fault data than entropy. When magnitude and entropy data are combined into a matrix, 36/38 of the PCs of the matrix have significantly low p-value. In the ‘bispectrum + PCA’ combinational feature extractions, three groups of the PCs are respectively selected to account for 70%, 90% and 100% of all variability in the bispectral data. The mean magnitude of the bispectrum and BE features are extracted from the time series of each variable of each sample, and then aligned

X. Du / Journal of Process Control 83 (2019) 1–10

7

Fig. 6. Box plot of data distribution of bispectral features and PCs for the semiconductor metal etch process data. (a) Mean magnitude of the spectrum (Eq. (2)); (b) Entropy (Eq. (3)); (c) PC projections on all the principal components of data from (a); (d) PC projections on all the principal components of data from (b); (e) PC projections on all the principal components for the bispectral data from both (a) and (b).

along the variable dimension for each sample to form a (sample × bispectral feature)#sample×(2×#variable) matrix. These bispectral datasets are further used for PCA feature extraction. 5.3.2. Classification results Four one-class classifiers are evaluated on five datasets: unfolded raw dataset, bispectral data, ‘bispectrum + 70%PCA’, ‘bispectrum + 90%PCA’, and ‘bispectrum + 100%PCA’. The classifier parameters are selected for the best performance evaluation results. The results of PCA-T2 and PCA-SPE are based on 95% confidence level (refer to [5]). In PCA, the number of PCs selected is the minimum number of PCs that account for the percentage of the total variance explained by these PCs. In k-NN, the choice of k is usually uncritical. Larger values of k can reduce the effect of

noise on the fault detection, but make boundaries between normal and fault batches less distinct. We try several different values of k on historical data and choose the one that gives the best cross validation. In this paper, k = 3 and confidence interval = 0.99, which is the same as the research in [He and Wang 2001]. In SVDD, the parameter C in the SVDD model (Eq. (11)) can be tuned to determine the confidence level of the monitoring statistic and control the tradeoff between the volume of the hypersphere and the classification error of the model. We try several values of C and choose the one that gives the best cross validation. In this paper, C = 1.031 and confidence interval = 0.99. We apply 10-fold cross validation for training and testing. For each fold, approximately 1/10 of the normal data are randomly selected for testing and the remaining 9/10 normal data are used

8

X. Du / Journal of Process Control 83 (2019) 1–10

Table 1 Machine State Variables Used for Monitoring Semiconductor Metal Etch Process. 1 5 9 13 17

BCl3 flow Endpt Al RF load TCP tuner TCP Rfl Pwr

2 6 10 14 18

Cl2 flow He press RF phase err TCP phase err TCP load

3 7 11 15 19

RF Btm Pwr Pressure RF Pwr TCP impedance Vat valve

4 8 12 16

RF Btm Rfl Pwr RF tuner RF impedance TCP top Pwr

“Btm”, “bottom”; “Pwr”, “power”; “Rfl”, “reflected”; “Endpt”, “end-point”.

Table 2 Results of bispectral features for normal and fault classes. Features

Entropy

Mag

Entropy PCA

Mag PCA

Entropy + Mag PCA

# dimension with p-value <0.05

4/19

7/19

10/19

19/19

36/38

Fig. 7. Comparison of different classification methods on the differently processed datasets. (a) TP rate. (b) TN rate. (c) AUC. (d) G-means.

for training. All the faulty data will be mixed with these normal data in each fold for testing. Fig. 7 shows that using ‘bispectrum +100%PCA’ dataset, k-NN and SVDD can achieve 100% fault detection rate, while k-NN maintains a zero false alarm rate. In Figs. 7(c) and 5 (d), these two methods show the best G-means and AUC results among all classifiers. The other classifiers cannot balance well between high fault detection rate and low false alarm rate. Using raw data for pattern classification results in low fault detection rates, G-means and AUC. Such a bad performance is mainly due to the complex nonlinear and non-Gaussian physics in the semiconductor process. Also, the data unfolding fails to retain the continuity and dynamics features in the original time series data. Directly using bispectral features for pattern classification can somehow improve the G-means and AUC results. Bispectral features retain the nonlinear and non-Gaussian characteristics in the time series data. However, the bispectral data are redundant and include many uncorrelated variables, which can degrade

the performance of the classifiers (refer to Fig. 6 and Table 1). The linear combination of spectral features using PCA technique significantly improves the fault detection rate, G-means and AUC of most of the pattern classifiers. Increasing principal components’ number in classification mostly can improve the classification performance. The best performance occurs when 100% principal components of bispectral features are selected for kNN. Additionally, the computation times of the classification algorithms are evaluated in MATLAB on a desktop with Intel (R) Core (TM) i7-7700 CPU @ 4.2 GHz and 32 GB of installed RAM. Fig. 8 shows the mean computation times for various classifiers on the differently processed datasets. The computation over bispectral and PCA datasets is faster than that of raw data. For ‘bispectrum + PCA’ dataset, SVDD has the slowest computation rate for the spectral features. The computation speed of k-NN classifier was improved significantly by three orders of magnitude after using bispectral features.

X. Du / Journal of Process Control 83 (2019) 1–10

9

if the presented ‘bispectrum + PCA’ method is to be implemented for on-line fault detection in real time manufacturing.

References

Fig. 8. Computation time of different classifier on the differently processed datasets.

6. Conclusion In this paper, we present a bispectrum and PCA based feature extraction for accurate fault detection in the nonlinear and non-Gaussian time series data. The experimental evaluation is performed using one-class classifiers for fault detection in nonlinear, non-Gaussian, discontinuous, and complex real semiconductor data. The bispectral features can extract significant nonlinear and non-Gaussian information for fault detection for most of the classifiers. Our evaluation results indicate that the features of ‘bispectrum + PCA’ have very different values for the normal and faulty classes, and, hence, can accurately and robustly discriminate between the two classes. The evaluated classifiers including k-NN and SVDD can achieve 100% fault detection rate, while k-NN maintains a zero false alarm rate. SVDD has a false alarm rate in 60%–80%. PCA-T2 and PCA-SPE seem to perform the worst. They cannot achieve a balance at high fault detection rate and low false rate. 7. Discussion The one-class k-NN classifier having such good performance for the data of semiconductor metal etch process indicates that the data can essentially be transformed into s¨ pheres-¨ this is also indicated by SVDD’s good performance, which also works by defining a hypersphere. Such a performance may not be the case in any other datasets, but it is possible that the majority of process data may be of such shapes, which would make the proposed method well suited for fault detection in other manufacturing processes. The proposed spectral features provide an accurate tool for fault detection. However, it is difficult to explain the process physics behind the spectral features, and further to find the root cause of the fault. Generally speaking, the proposed technique is the use of nonlinear transformation to describe the nonlinearities between data. The nonlinear data processing techniques might negatively affect the ability to execute fault isolation and root cause analysis, due to its induced difficulties in the identification procedure. A limited amount of literature has been dedicated to the isolation and location of detected faults affecting a variable for the nonlinear case, which only partially resolved the problem by using a regularization function [34]. However, PCA models have the simplicity for the ease of interpretability. With the inclusion of spectral features, the frequencies and variables contributing to the observed fault might be identified. While this is feasible, it is rarely an easy task to interpret a PCA model. In the few cases where it is easy to interpret a PCA model, it is found that the domain expert has already discovered the same interpretation of the data without a PCA model. Another possible solution is to adapt the proposed features for faulty variable isolation and further to pinpoint the root cause of a fault. We can combine additional process knowledge with the isolated faulty variables to help find the root cause. Meanwhile, the extraction of bispectral features from the time-series data seems to be much slower than classification processes. It would be necessary to improve the bispectral extraction algorithm for high speed

[1] T. Kourti, J.F. MacGregor, Process analysis, monitoring and diagnosis, using multivariate projection methods, Chemom. Intell. Lab. Syst. 28 (1) (1995) 3–21. [2] B.M. Wise, N.B. Gallagher, S.W. Butler, D.D. White, G.G. Barna, A comparison of principal component analysis, multiway principal component analysis, trilinear decomposition and parallel factor analysis for fault detection in a semiconductor etch process, J. Chemom. 13 (3–4) (1999) 379–396. [3] LAM 9600, Metal Etch Data for Fault Detection Evaluation, 1999, available from Accessed 20 January 2019 http://www.eigenvector.com/data/Etch/. [4] T. Wang, M. Qiao, M. Zhang, Y. Yang, H. Snoussi, Data-driven prognostic method based on self-supervised learning approaches for fault detection, J. Intell. Manuf. (2018), http://dx.doi.org/10.1007/s10845-018-1431-x. [5] Q.P. He, J. Wang, Statistical process monitoring as a big data analytics tool for smart manufacturing, J. Process Control 67 (2018) 35–43. [6] F. Lv, C. Wen, M. Liu, Z. Bao, Higher-order correlation–based multivariate statistical process monitoring, J. Chemom. (2018) e3033. [7] T. Inouye, K. Shinosaki, H. Sakamoto, S. Toi, S. Ukai, A. Iyama, et al., Quantification of EEG irregularity by use of the entropy of the power spectrum, Electroencephalogr. Clin. Neurophysiol. 79 (3) (1991) 204–210. [8] U.R. Acharya, S. Dua, X. Du, C.K. Chua, Automated diagnosis of glaucoma using texture and higher order spectra features, Ieee Trans. Inf. Technol. Biomed. 15 (3) (2011) 449–455. [9] Q.P. He, J. Wang, Fault detection using the k-nearest neighbor rule for semiconductor manufacturing processes, Ieee Trans. Semicond. Manuf. 20 (4) (2007) 345–354. [10] Q.P. He, J. Wang, Large-scale semiconductor process fault detection using a fast pattern recognition-based method, Ieee Trans. Semicond. Manuf. 23 (2) (2010) 194–200. [11] Z. Zhou, C. Wen, C. Yang, Fault isolation based on k-nearest neighbor rule for industrial processes, Ieee Trans. Ind. Electron. 63 (4) (2016) 2578–2586. [12] S. Mahadevan, S.L. Shah, Fault detection and diagnosis in process data using one-class support vector machines, J. Process Control 19 (10) (2009) 1627–1639. [13] Z. Ge, F. Gao, Z. Song, Batch process monitoring based on support vector data description method, J. Process Control 21 (6) (2011) 949–959. [14] J. Yu, S.J. Qin, Multiway Gaussian mixture model based multiphase batch process monitoring, Ind. Eng. Chem. Res. 48 (18) (2009) 8585–8594. [15] J. Yu, Fault detection using principal components-based Gaussian mixture model for semiconductor manufacturing processes, Ieee Trans. Semicond. Manuf. 24 (3) (2011) 432–444. [16] Y. Zhang, Y. Fan, W. Du, Nonlinear process monitoring using regression and reconstruction method, Ieee Trans. Autom. Sci. Eng. 13 (3) (2016) 1343–1354. [17] M. Onel, C.A. Kieslich, Y.A. Guzman, C.A. Floudas, E.N. Pistikopoulos, Big data approach to batch process monitoring: simultaneous fault detection and diagnosis using nonlinear support vector machine-based feature selection, Comput. Chem. Eng. 115 (2018) 46–63. [18] S. Yin, S.X. Ding, X. Xie, H. Luo, A review on basic data-driven approaches for industrial process monitoring, Ieee Trans. Ind. Electron. 61 (11) (2014) 6418–6428. [19] Z. Yin, J. Hou, Recent advances on SVM based fault diagnosis and process monitoring in complicated industrial processes, Neurocomputing 174 (2016) 643–650. [20] E.N. van Sprang, H.J. Ramaker, J.A. Westerhuis, S.P. Gurden, A.K. Smilde, Critical evaluation of approaches for on-line batch process monitoring, Chem. Eng. Sci. 57 (18) (2002) 3979–3991. [21] Q.P. He, J. Wang, Statistics pattern analysis: a new process monitoring framework and its application to semiconductor batch processes, Aiche J. 57 (1) (2011) 107–121. [22] W.J. Wang, Z.T. Wu, J. Chen, Fault identification in rotating machinery using the correlation dimension and bispectra, Nonlinear Dyn. 25 (4) (2001) 383–393. [23] G. Dong, J. Chen, F. Zhao, A frequency-shifted bispectrum for rolling element bearing diagnosis, J. Sound Vib. 339 (2015) 396–418. [24] Jing Tian, Carlos Morillo, Michael H. Azarian, Michael Pecht, Motor bearing fault detection using spectral kurtosis-based feature extraction coupled with K-nearest neighbor distance analysis, IEEE Trans. Ind. Electron. 63 (3) (2016) 1793–1803. [25] Y. Li, X. Liang, M.J. Zuo, Diagonal slice spectrum assisted optimal scale morphological filter for rolling element bearing fault diagnosis, Mech. Syst. Signal Process. 85 (2017) 146–161. [26] C.L. Nikias, M.R. Raghuveer, Bispectrum estimation: a digital signal processing framework, Proc. IEEE 75 (7) (1987) 869–891. [27] S.Y. Liang, R.L. Hecker, R.G. Landers, Machining process monitoring and control: the state–of–the–art (2002, January), in: ASME 2002 International Mechanical Engineering Congress and Exposition, American Society of Mechanical Engineers, 2002, pp. 599–610. [28] P. Henriquez, J.B. Alonso, M.A. Ferrer, C.M. Travieso, Review of automatic fault diagnosis systems using audio and vibration signals, IEEE Trans. Syst. Man Cybern. Syst. 44 (5) (2013) 642–652.

10

X. Du / Journal of Process Control 83 (2019) 1–10

[29] K. Zhu, Y. San Wong, G.S. Hong, Wavelet analysis of sensor signals for tool condition monitoring: a review and some new results, Int. J. Mach. Tools Manuf. 49 (7–8) (2009) 537–553. [30] D.E.D. Snr, Sensor signals for tool-wear monitoring in metal cutting operations—a review of methods, Int. J. Mach. Tools Manuf. 40 (8) (2000) 1073–1098. [31] Z. Hameed, Y.S. Hong, Y.M. Cho, S.H. Ahn, C.K. Song, Condition monitoring and fault detection of wind turbines and related algorithms: a review, Renewable Sustainable Energy Rev. 13 (1) (2009) 1–39.

[32] A. Swami, J.M. Mendel, C.L. Nikias, Higher-Order Spectral Analysis Toolbox: for Use With MATLAB, 1998. [33] M.J. Hinich, Testing for Gaussianity and linearity of a stationary time series, J. Time Ser. Anal. 3 (1982) 169–176. [34] C.F. Alcala, S.J. Qin, Reconstruction-based contribution for process monitoring with kernel principal component analysis, Ind. Eng. Chem. Res. 49 (17) (2010) 7849–7857.