Expert Systems with Applications 38 (2011) 10624–10630
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Estimating the shift size in the process mean with support vector regression and neural networks Chuen-Sheng Cheng ⇑, Pei-Wen Chen, Kuo-Ko Huang Department of Industrial Engineering and Management, Yuan Ze University, 135 Yuan-Tung Road, Chung-Li 320, Taoyuan, Taiwan, ROC
a r t i c l e
i n f o
Keywords: CUSUM Mean shifts Neural networks Support vector regression
a b s t r a c t Control charts are usually used in manufacturing and service industries to determine whether a process is performing as intended or if there are some unnatural causes of variation. Once the control chart detects a process change, the next issue is to ‘‘search for assignable causes’’, or ‘‘take corrective actions’’, etc. Before corrective actions are taken, it is critical to search for the cause of the out-of-control situation. During this search process, knowledge of the current parameter level can be helpful to narrow the set of possible assignable causes. Sometimes, the process/product parameters might be adjusted following the out-of-control signal to improve quality. An accurate estimate of the parameter will naturally provide a more precise adjustment of the process. A distinct weakness of most existing control charts techniques is that they merely provide out-of-control signals without predicting the magnitudes of changes. In this paper, we develop a support vector regression (SVR) model for predicting the process mean shifts. Firstly, a cumulative sum (CUSUM) chart is employed to detect shifts in the mean of a process. Next, an SVR-based model is used to estimate the magnitude of shifts as soon as CUSUM signals an out-of-control situation. The performance of the proposed SVR was evaluated by estimating mean absolute percent errors (MAPE) and normalized root mean squared errors (NRMSE) using simulation. To evaluate the prediction ability of SVR, we compared its performance with that of neural networks and statistical methods. Overall results of performance evaluations indicate that the proposed support vector regression model has better estimation capabilities than CUSUM and neural networks. Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction Statistical process control (SPC) concepts and methods have been successfully implemented in manufacturing and service industries for decades. As one of the primary SPC tools, control chart plays a very important role in attaining process stability. Once the control chart detects a process change, the next issue is to ‘‘search for assignable causes’’, or ‘‘take corrective actions’’, etc. Before corrective actions are taken, it is critical to search for the cause of the out-of-control situation. During this search process, knowledge of the current parameter level can be helpful to narrow the set of possible assignable causes. Sometimes, the process/product parameters might be adjusted following the out-of-control signal to improve quality. An accurate estimate of the parameter will certainly provide a more precise adjustment of the process. A distinct disadvantage of most existing control charts techniques is that they purely provide out-of-control signals without predicting the magnitudes of changes, with the exception of the cumulative sum (CUSUM) control scheme (Page, 1961) and the exponentially ⇑ Corresponding author. Tel.: +886 3 4638899; fax: +886 3 4638907. E-mail address:
[email protected] (C.-S. Cheng). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.02.121
weighted moving average chart (Roberts, 1959). When using the CUSUM schemes in monitoring the mean shifts, Montgomery (2005) provided an explicit form to estimate the magnitude of shifts. A modified EWMA procedure useful for estimating the new process mean can also be found in Montgomery (2005). Estimating the process parameter is considered a difficult challenge in SPC. On the one hand we wish to detect the process changes as quickly as possible, and on the other we may have insufficient data points to estimate the new process parameter. Due to the advent of technological advancement and manufacturing automation, it is now feasible to apply more sophisticated monitoring procedures. In recent years, attempt to applying artificial neural networks (ANN) to process control have been studied by several researchers with significant results. Artificial neural networks can be used to analyze process status from input of control chart samples. For an in-depth review of the applications of neural networks for process monitoring, the reader is referred to Zorriassatine and Tannock (1998) and Barghash and Santarisi (2007). What follows is a brief review of previous applications of neural networks to statistical process control, as relevant to this study. Chang and Aw (1996) developed a neural fuzzy control chart to detect mean shifts. Their network can also classify magnitudes of
C.-S. Cheng et al. / Expert Systems with Applications 38 (2011) 10624–10630
the shifts in order to expedite the diagnosis of assignable causes. Guh and Hsieh (1999) proposed an artificial neural network based model, which contains several back propagation networks, to both recognize the abnormal control chart patterns and estimate the parameters of abnormal pattern. Guh and Tannock (1999) introduced a neural network-based approach to recognize control chart patterns and identify key parameters of the specific patterns involved. Their approach is applicable to both single and concurrent patterns. Wang and Chen (2002) developed a neural-fuzzy model not only to identify mean shifts but also to classify their magnitudes. Chen and Wang (2004) also proposed a neural network approach for monitoring and classifying the mean shifts of multivariate process. Guh (2005) proposed a hybrid model integrating neural network and decision tree to detect and classify control chart patterns, and simultaneously estimate the major parameter of the detected pattern. In recent years, the support vector machine (SVM) has been introduced as a new technique for solving a variety of learning, classification and prediction problems. There is also some research on using support vector machines to monitor process variation (Cheng & Cheng, 2008; Chinnam, 2002; Sun & Tsung, 2003). Support vector regression (SVR), a regression version of SVM, has been recently developed to estimate regression functions. Same as SVM, SVR is capable of solving non-linear problems using kernel functions and has been successful in various domains. In this paper, we propose using SVR to predict the magnitudes of process shift. Feed-forward neural networks with different training algorithms and CUSUMbased estimator are used as benchmarks for comparison. Remaining of the paper is structured in the following manner. Section 2 explains the research methodologies adopted by this research, including the generation of training/test data, input vectors, and performance metrics. Section 3 presents a brief introduction to SVR and also describes the development of an SVR-based prediction model. Section 4 describes the prediction models constructed using neural networks. This section also describes the experiments performed to select the neural network model with the best performance. In Section 5, results of comparative study are discussed. The final section concludes the paper and provides suggestions for future research.
2. Research design The proposed system is schematically illustrated in Fig. 1. It includes a shift detector and a prediction model. The traditional CUSUM chart works as a mean shift detector. Next, a prediction model is used to estimate the magnitude of shifts as soon as CUSUM signals an out-of-control situation. Note that this research formulated the shift characterization as a regression problem instead of a classification problem. This is a natural choice since the shift size D e R. Some researchers (see, for example, Chang & Aw, 1996; Chiu, Chen, & Lee, 2001; Chen & Wang, 2004) addressed the shift characterization as a multi-class classification problem. This approach would require too many output nodes, and is computation intensive.
Collect Process data
Monitor process data (CUSUM)
10625
Trying to diminish the number of classes would lead to lower precision and lower performance. The CUSUM control scheme is an effective alternative to the traditional Shewhart control chart. CUSUM procedure attempts to incorporate information from the entire set of data. Therefore, the CUSUM scheme can offer considerable performance improvement relative to Shewhart charts. The standardized CUSUM schemes for monitoring the process mean are based on the following recursive statistics,
C þi ¼ maxf0:0; zi k þ C þi1 g; C i ¼ maxf0:0; zi k þ C i1 g;
ð1Þ
where zi can be the standardized statistic of a single observed value or the average of several observed values taken at sampling time i. The parameter k is usually called the reference value or the slack value. The statistics C+ and C are called one-sided upper and lower CUSUM charts, respectively. The starting values of CUSUM statistics are C+ = C = 0. The CUSUM control chart signals if either C+ or C exceed the decision interval h. In general, CUSUM control schemes require the setting of reference value k and decision interval h before implementation. A CUSUM with k = 0.5 and h = 5.0 has been widely used in practice. This CUSUM scheme was adopted in this research. When using the CUSUM schemes in monitoring the mean shifts, Montgomery (2005) recommends the following equations to estimate the magnitude of shifts:
^¼ D
8 < k þ C þi
if C þi > h;
: k
if C i > h;
Nþ C Ni
ð2Þ
^ is the estimated magnitude of shift expressed in standard where D deviation units. The quantities N+ and N are the number of consec utive periods that the CUSUM statistics C þ i and C i have been nonzero. The CUSUM can be thought of as a weighted average of all past and current observations, in which we give equal weight to the last N+ (or N) observations and weight zero to all other observations. Note that the weights are stochastic or random. 2.1. Selection of input vector The proposed prediction model is based on the assumption that there are a number of observations ready for analysis. The number of data in a sequence provided to the prediction model is referred to here as the window size, m. There are many different elements that could be used to construct an appropriate input vector for this problem. Sample data are natural candidates for elements of input vector. In this study, the inputs used by a prediction model are the most recent m observations, whereas the output is the predicted size of the process shift. The prediction function has the form:
D ¼ f ðV t Þ ¼ f ðxt mþ1 ; xt mþ2 ; . . . ; xt Þ; ⁄
ð3Þ
where t is the sample index when CUSUM signals an out-of-control situation, V t ¼ fxt mþ1 ; xt mþ2 ; . . . ; xt g and D e {6, 5.75, . . . , 1, +1, . . . ,+5.75, +6}.
Is the process out of control?
yes
no Fig. 1. General framework of the proposed system.
Predict the shift size
10626
C.-S. Cheng et al. / Expert Systems with Applications 38 (2011) 10624–10630
The next question is to select how many observations to construct an input vector. In many problems there is a need to diminish the number of features and select a subset of relevant features for building robust models. This process is known as feature selection or variable selection. The purposes of the feature selection method are to reduce data complexity and improve accuracy. In statistics, the most prevalent form of feature selection is stepwise regression. In machine learning, classical feature selection methods typically fall into three major categories: filters, wrappers and embedded methods (Guyon & Elisseeff, 2003). In this study the number of observations (or equivalently the window size) was determined by taking into account the domain knowledge of the application problem. We discuss the choice of the number of observations in the following. When a control chart signals an out-of-control situation there are RL (run length) observations drawn from shifted process. The run length is the random variable that represents the number of subgroups required until an out-ofcontrol condition is indicated by the chart. The ARL is the expected value of the run length. It measures how quickly the chart reacts to changes in the process. We used the ARL profile of CUSUM chart to determine the window size. In this investigation, the smallest process shift which is important to detect quickly is D = ±1. According to Montgomery (2005), the ARL of CUSUM with k = 0.5 and h = 5.0 is about 10.4, therefore, it is reasonable to set m = 16. Table 1 summarizes the sample percentiles of N+ (N) based on 5000 simulation trials for each D. It is observed that the value of N+ (N) tends to increase as the shift size D decreases. For the smallest shift size (D = ±1), the 90th percentile of N+ (N) was found to be 15, indicating that 90% of the N+ (N) values are less than or equal to 15. This empirical analysis also supports the choice of m = 16. The chosen window size can also be justified by the properties of EWMA chart. The performance of the EWMA chart has been shown to be approximately equivalent to that of the CUSUM chart (Montgomery, 2005). The EWMA can be viewed as a weighted average of all past and current observations. The weighting properties are controlled by a parameter known as smoothing constant. Table 1 Percentiles of N+ (N).
D
10th
25th
50th
75th
90th
95th
1.0 2.0 3.0 4.0 5.0 6.0
5 3 2 2 1 1
6 4 2 2 2 1
8 4 3 2 2 1
12 6 4 4 3 3
15 8 7 6 5 5
18 10 9 8 7 7
0.20
Smoothing constant 0.05 0.1 0.2
Weight
0.15
0.10
The weights of popular choices of smoothing constants are plotted in Fig. 2. It is seen from Fig. 2 that the weights decrease to small values with the age of observation. The curves shown in Fig. 2 indicate that the most recent 32 observations are the main contributors in determining the process status. This is evidenced by the fact that the curve becomes a flat line when the age of observation is greater than 32. In other words, the observations beyond 32 are probably unimportant. In consideration of computing efficiency, the window size was set to 16. This setting is roughly equal to an EWMA with smoothing constant of 0.2. With a window size of 16, we have 16 observations (elements) as the input features. As with CUSUM and EWMA, the proposed prediction model can be thought of as a weighted average of current and past data, in which the weights are given to the observations within the analysis window. In the current application, we assume that the in-control process mean (target) l and standard deviation r are either known or that an estimate is available. The process data will be standardized so that the prediction model can be used both for individual observation and for the average of rational subgroups. 2.2. Generation of data Due to extensive demands for training and test data, Monte Carlo simulation was used to generate data. The simulation program was implemented in Matlab. Without loss of generality, we assume the in-control average and standard deviation are l0 = 0 and r20 ¼ 1, respectively. The out-of-control data corresponding to shifts were from Nðl0 þ D r0 ; r20 Þ, where D is the size of the shift measured in standard deviation units. Samples generated by Monte Carlo simulations were fed into CUSUM to create input vectors of the prediction model. When CUSUM signals an out-of-control situation, the most recent 16 observations are taken to form an input vector. In the current research it is assumed that process shifts, regardless of the magnitude of the shift, is to be detected as quickly as possible. Accordingly, equal number of training samples was established for each magnitude of shift. For each D, 6250 samples were generated with 20% (1250) being the training data and 80% (5000) the test data. 2.3. Performance criteria In estimating the prediction performance, it is important to capture both positive and negative errors and not let positive and negative errors offset each other. Therefore, mean absolute percentage error (MAPE) and normalized root mean squared error (NRMSE) were used to evaluate the prediction accuracy of SVR and other prediction models. MAPE and NRMSE were used to measure the deviation between the actual and predicted values. The smaller the values of MAPE and NRMSE, the closer were the predicted values to the actual values. We note that a few previous studies (Guh, 2005; Guh & Tannock, 1999) reported the prediction performance in terms of average and standard deviation of predicted outputs, which are not commonly used metrics in the literature. The performance metrics adopted in this research are calculated as follows:
N ^ 1X yi yi 100%; N i¼1 yi
0.05
MAPE ¼
0.00
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uPN u ^ i Þ2 ðyi y NRMSE ¼ t i¼1 ; PN 2 i¼1 yi
0
5
10 15 20 25 Age of observation
Fig. 2. Weights of past observations.
30
35
where N is the number of data samples.
ð4Þ
ð5Þ
10627
C.-S. Cheng et al. / Expert Systems with Applications 38 (2011) 10624–10630
subject to
3. A support vector regression based regression-based prediction model In this section, the building process of SVR-based prediction model is presented in detail. Firstly, a basic theory of SVR is introduced. The critical issue of model selection and training are then described.
wT /ðxi Þ þ bi yi 6 e þ ni ;
ð10Þ
yi wT /ðxi Þ bi 6 e þ ni ;
ð11Þ
ni ; ni P 0;
ð12Þ
i ¼ 1; 2; . . . ; N; ⁄
3.1. Mathematical formulation of support vector regression The support vector machines (SVMs) were proposed by Vapnik (1998). Based on the structured risk minimization (SRM) principle, SVMs seek to minimize the upper bound of generalization error rather than the empirical error as in other neural networks. Support vector regression (SVR), a regression version of SVM, was developed to estimate regression functions. Like SVM, SVR is capable of solving non-linear problems and has been successful in various domains. For more details on support vector regression, please refer to Smola and Schölkopf (2004). In regression formulation, the goal is to estimate an unknown continuous-valued function based on a finite number set of noisy training samples. For a brief review of SVR, consider a regression function f(x) which is estimated based on given training data {(xi, yi)|i = 1, 2, . . . , N}, where d-dimensional input x e Rd and the output y e R. The SVR regression function is formulated in the high dimensional feature space, with the following form:
f ðxÞ ¼ wT /ðxÞ þ b;
ð6Þ
where /(x) is the high dimensional feature space which is non-linearly mapped from the input space x. The objective is to choose a hyperplane w with small norm, while simultaneously minimizing the sum of the distances from data points to the hyperplane. The coefficients w and b are found by solving the primal optimization problem:
Minimize C
N 1X 1 Le ðyi ; f ðxÞÞ þ jjwjj2 ; N i¼1 2
ð7Þ
where
Le ðy; f ðxÞÞ ¼
jy f ðxÞj e; jy f ðxÞj P e; 0;
otherwise:
ð8Þ
In the regularized risk function given by Eq. (7), the first term P Cð1=NÞ Ni¼1 Le ðyi ; f ðxÞÞ is the empirical error (risk). SVR performs the regression estimation by risk minimization where the risk is measured by a loss function. Several alternatives are available. The most common loss function is the e-insensitive loss function proposed by Vapnik (1998) as given by Eq. (8). The second term 1/2||w||2 appeared in the regularized risk function is the regularization term. At the e-insensitive loss function, e is called the tube size and it is equivalent to the approximation accuracy placed on the training data points. Errors below e do not count. In other words, we do not care about errors as long as they are less than e. Points inside the e-insensitive tube do not penalize the minimization problem. C is referred to as the regularized constant and it determines the trade-off between the empirical risk and the regularization term. Increasing the value of C will result in the relative importance of the empirical risk with respect to the regularization term to grow. Both C and e are user-prescribed parameters and are selected empirically. To obtain the estimations of w and b, Eq. (7) is transformed to the following soft margin problem:
Minimize C
N X 1 ðni ; ni Þ þ jjwjj2 2 i¼1
ð9Þ
where n and n are positive slack variables used to cope with infeasible constraints of the optimization problem. By adding Lagrangian multipliers, the above problem can be optimized as a dual problem. In addition, SVR can estimate a non-linear function by employing a kernel function, K(xi, xj) = /(xi) /(xj). The regression function estimated by SVR can be written as follows:
f ðxÞ ¼
N X ðai ai ÞKðx; xi Þ þ b;
ð13Þ
i¼1
where ai and ai are Lagrange multipliers, which are the solutions to the dual problem. They satisfy the equalities ai ai ¼ 0; ai P 0 and ai P 0 where i = 1, . . . , N, and are obtained by maximizing the dual function of Eq. (13) which has the following form:
Maximize
N X
yi ðai ai Þ e
i¼1
N X ðai þ ai Þ i¼1
N X N 1X ðai ai Þðaj aj ÞKðxi ; xj Þ; 2 i¼1 j¼1
ð14Þ
subject to N X ðai ai Þ ¼ 0;
ð15Þ
i¼1
0 6 ai ; ai 6 C;
i ¼ 1; 2; . . . ; N:
ð16Þ
In SVR the basic idea is to map the data into a high-dimensional feature space via a non-linear mapping /, and to do linear regression in this space. Thus, linear regression in a high dimensional space corresponds to non-linear regression in the low dimensional input space Rd. Expensive calculation of dot products in a highdimensional space can be avoided by introducing a kernel function satisfying K(xi, xj) = /(xi) /(xj). The kernel function allows all necessary computations to be performed directly in the input space. There are many choices of the kernel function, common examples are the radial basis function (RBF): K(xi, xj) = exp{c||xi xj||2}, c > 0 and the polynomial kernel of degree d : K(xi, xj) = (xi xj + 1)d. 3.2. Model selection and training Although support vector machines have shown excellent generalization performance in a number of applications, one problem that faces the user of an SVR is how to choose a kernel and the specific parameters for that kernel. Applications of an SVR therefore require a search for the optimum settings for a particular application. The kernel functions map the original data into higher-dimension space and make the input data set linearly separable in the transformed space. The choice of kernel functions is highly problem-dependent and it is the most important factor in support vector machine applications. In this work, the RBF kernel is used as the kernel function of the SVR because it tends to give better performance. The parameters that must be determined are the kernel parameter c, the regularization parameter C, and e. Kernel parameter defines the structure of the high dimensional feature space. The kernel parameter is usually selected through experimentation. The regularization parameter C should be chosen with caution to avoid over-fitting. In this study, the free parameters of SVR were selected followed a 5-fold cross-validation experiment. SVR was implemented in
C.-S. Cheng et al. / Expert Systems with Applications 38 (2011) 10624–10630
STATISTICA software (STATISTICA, 2009). The chosen kernel was based on RBF with c = 0.0625. The regularization parameter C and e were set to 8 and 26, respectively. Fig. 3 shows an example of the grid search result, where the x-axis and the y-axis are log2C and log2e, respectively. The z-axis is the 5-fold average performance. The findings of this experiment were that SVR is quite robust against parameter selections. 4. A neural network-based prediction model In recent years, neural-network approach has proven to be a universal approximator for any non-linear continuous function with an arbitrary accuracy. The most frequently used neural network is feed-forward multi-layer perceptrons (MLP) trained with backpropagation algorithm. The typical MLP-type neural network is composed of an input layer, one or more hidden layers, and an output layer of nodes. The determination of the number of nodes in each layer is described as follows. In the proposed methodology, the window size will determine the number of input nodes. The number of nodes in the output layer was determined by the output representation scheme. A single output node was used which represents the predicted magnitude of the process shift. There are no general guidelines for determining the optimal number of nodes required in the hidden layer. The number of hidden layer nodes was chosen through experimentation by varying the number of hidden nodes from 6 to 15. There are various algorithms available to train the MLP in a supervised manner. Gradient descent (GD) is a first order optimization algorithm that attempts to move incrementally to successively lower points in search space in order to locate a minimum. The basic gradient descent method adjusts the weights in the steepest descent direction. This is the direction in which the performance function is decreasing most rapidly. Although the function decreases most quickly along the negative of the gradient, this approach does not always generate the fastest convergence. Conjugate gradient (CG) is a fast training algorithm for MLP that proceeds by a series of line searches through error space. Succeeding search directions are selected to be conjugate (non-interfering). A search along conjugate directions usually produces faster convergence than steepest descent directions. BFGS (or Quasi-Newton) is a
powerful second order training algorithm with very fast convergence but high memory requirements. In this study the aforementioned algorithms were examined and compared. The neural networks used in this study were developed based on Matlab 7.8 neural network toolbox. The input data were scaled to the interval [1, 1] using a simple linear transformation due to the reason that the data points include both positive and negative values. In this research, the transfer function selected for hidden nodes and output node is a hyperbolic tangent function. The MAPE and RNMSE of different ANN models are summarized in Table 2. Values shown are the average of MAPE and RNMSE based on 10 replications with different initial weights. It can be seen that MLP trained with BFGS algorithm achieves better prediction performance than other algorithms in terms of MAPE and RNMSE. Note that the performance of BFGS improves as the number of hidden nodes increases. However, the results also indicate that the number of hidden layer nodes has a diminishing return in prediction accuracy. In consideration of computational complexity and the risk of over-fitting, the number of hidden nodes was set to 12. This setting will be used in subsequent analysis. The detailed results on the test set are shown in Fig. 4. This figure is given as an illustration to provide insight into the performance differences of the algorithms. The number of hidden nodes for each algorithm was set based on the results shown in Table 2. The performance of the conjugate gradient is slightly better than the traditional gradient descent method but outperformed by
80 Variable GD CG BFGS
70 60 MAPE(%)
10628
50 40 30 20 10 0 -6.00
-4.00
-2.00
2.00
4.00
6.00
Δ Fig. 4. Performance comparison of various training algorithms.
25 Table 3 Summarized results.
MAPE (%) 20 0 15
-5 0 1
-10 3 log2(C)
5
7
9
log2(epsilon)
-15 NRMSE MAPE (%)
Fig. 3. A surface plot of grid search (c = 0.0625).
Direct estimate
CUSUM
BFGS
SVR
Test
Test
Training
Test
Training
Test
0.1908 17.83
0.3605 27.16
0.1875 19.53
0.1893 19.98
0.1387 14.02
0.1422 14.48
Table 2 Parameters of testing examples for unnatural patterns. Gradient descent
6 9 12 15 ⁄
MAPE (NRMSE).
Conjugate gradient
Training
Test
24.79 24.26 24.82 24.52
25.43 24.87 25.35 25.12
(0.2223) (0.2191) (0.2224) (0.2207)
(0.2261) (0.2228) (0.2257) (0.2244)
BFGS
Training
Test
23.99 23.46 24.08 23.67
24.88 24.06 24.46 24.18
(0.2173) (0.2146) (0.2184) (0.2157)
(0.2217) (0.2174) (0.2195) (0.2185)
Training
Test
20.29 20.07 19.53 19.25
21.16 20.73 19.98 19.86
(0.1937) (0.1910) (0.1875) (0.1846)
(0.1977) (0.1946) (0.1893) (0.1879)
10629
C.-S. Cheng et al. / Expert Systems with Applications 38 (2011) 10624–10630
BFGS. The results indicate that the BFGS algorithm outperforms the other two competing algorithms in all shifts, except when D is between 2.25 and 4.75. 5. Performance comparison This section compares the estimation capabilities of SVR, ANN, and statistical methods. The performances of several estimating procedures were evaluated in terms of MAPE and RNMSE. Table 3 summarizes the results of ANN, SVR, and other competing methods. It is pointed out here that the direct estimate method computes the process mean by counting backward from the out-ofcontrol signal issued by CUSUM to the time period where the process shifts (which is normally unknown in practice). Here, MAPE and RNMSE of CUSUM and direct estimate method were computed based on 5000 simulation trials. From Table 3, we note that there is strong agreement between the training results and the testing results, indicating no over-fitting problems with ANN or SVR. From the table, it can be observed that SVR has smaller values of MAPE and NRMSE than those of other competing methods. As expected, the direct estimate method outperforms CUSUM method. Figs. 5 and 6 respectively present the standard deviation and average of the predicted outputs for various prediction models. From Fig. 5, we can see that the standard deviation of the predicted outputs of CUSUM increases as the actual shift size D increases.
Std. of predicted ouputs
2.00 1.75 1.50
Variable Direct estimate CUSUM BFGS SVR
1.25
The standard deviations of CUSUM are greater than other methods from D = 3.0 onward. This explains why CUSUM has the largest NRMSE value among different methods as shown in Table 3. On the other hand, SVM consistently produces the most stable outputs. Examination of Fig. 6 reveals that the average predicted shift size of the direct estimate method is generally higher than the actual shift size. The average predicted shift size of SVM and BFGS neural network is lower than D, but not as serious compared to the CUSUM estimator. The reason of inferior CUSUM performance lies in that the CUSUM estimator is based on a simple arithmetic average. Fig. 7 compares the estimation capability of SVR, ANN, and the CUSUM method. The MAPE on the test set is reported for each D. From the figure, it can be observed that CUSUM method yields very stable results but MAPE tends to increase slightly as D increases. It is clear from Fig. 7 that ANN and SVR are also capable of estimating process mean shifts. We can see that SVR outperforms ANN. Careful investigation of MAPE shows that the performance of ANN and SVR in estimating shift-magnitude tends to improve as the shift size increases. However, the MAPE starts to increase when D increases beyond 5.0. The performance of SVR is slightly worse than CUSUM for changes of magnitudes up to D = 1.75 but tends to be progressively better for changes of larger magnitudes. Although not detailed in this paper, the same observations were made in the case of NRMSE. Another comparison of interest is the prediction accuracy of ANN and SVR for different directions of mean shift. It is obvious that both approaches work well to predict changes of different directions. Finally, to further investigate the appropriateness of the chosen input vector a simple experiment was conducted. The selected variables were chosen by stepwise regression analysis and the STATISTICA Data Miner built-in feature selection unit (STATISTICA, 2009). Based on ANOVA, for continuous dependent variables, the program computes the ratio of the between-category variance to within category variance and further computes an F statistic and
1.00 60
0.75
Variable CUSUM BFGS SVR
50
2.00
3.00
4.00
5.00
6.00
Δ
MAPE(%)
0.50
Fig. 5. Standard deviation of the predicted outputs for various prediction models.
40 30 20 10
Average of predicted outputs
6 5
Variable Direct estimate CUSUM BFGS SVR
0 -6.00
6
-2.00
2.00
4.00
6.00
Δ
5
4
4
3
3
2
2
Fig. 7. Prediction accuracy for different estimation methods.
Table 4 The performance of SVR based on selected variables. Algorithm
No. of variables
Selected variables
MAPE
NRMSE
Stepwise regression
12 11 10 9
(1, 5, 6, 8–16) (1, 5, 8–16) (5, 8–16) (8–16)
16.16 17.40 17.12 18.34
0.1534 0.1569 0.1653 0.1683
Statistica built-in
12 11 10 9
(3–5, (4–5, (4–5, (4–5,
16.28 16.62 17.16 17.67
0.1560 0.1584 0.1619 0.1687
1 2.00
-4.00
3.00
4.00
5.00
6.00
Δ Fig. 6. Average of the predicted outputs for various prediction models.
7–9, 11–16) 7–9, 11–16) 7–9, 12–16) 7, 9, 12–16)
10630
C.-S. Cheng et al. / Expert Systems with Applications 38 (2011) 10624–10630
a P value for each predictor variable as the criterion of predictor importance. The program indicates that the P values of all 16 variables are less than the commonly chosen a level of 0.05. The performance of SVR based on selected variables is shown in Table 4. By comparing the results shown in Tables 3 and 4, it is clear that lessen the number of variables might deteriorate the performance of SVR. 6. Conclusion In this paper, we have proposed SVR as an approach to estimate process shift size. The performances of the proposed approach were evaluated by estimating the mean absolute percentage error (MAPE) and normalized root mean squared error (NRMSE) using simulation. The proposed SVR approach has been compared to ANN and traditional methods. Extensive comparisons show that the proposed approach presented in this paper offers a competitive alternative to existing control procedures. Our study reveals that SVR achieves the best performance. The results indicate that SVR is a promising tool in estimating the magnitude of mean shift. Future research can be performed in a number of areas. In this study, CUSUM was employed as the shift detector. Therefore, one area of research is to study the effect of CUSUM parameters on prediction performance. A second area would be the construction of input vector based on features extracted from original input observations. References Barghash, M. A., & Santarisi, N. S. (2007). Literature survey on pattern recognition in control charts using artificial neural networks. In Proceeding 37th Computers and Industrial Engineering. Alexandria, Egypt (pp. 20–23).
Chang, S. I., & Aw, C. A. (1996). A neural fuzzy control chart for detecting and classifying process mean shifts. International Journal of Production Research, 34(8), 2265–2278. Chen, L. H., & Wang, T. Y. (2004). Artificial neural networks to classify mean shifts from multivariate chart signals. Computers and Industrial Engineering, 47(2–3), 195–205. Cheng, C. S., & Cheng, H. P. (2008). Identifying the source of variance shifts in the multivariate process using neural networks and support vector machines. Expert Systems with Applications, 35(1–2), 198–206. Chinnam, R. B. (2002). Support vector machines for recognizing shifts in correlated and other manufacturing processes. International Journal of Production Research, 40(17), 4449–4466. Chiu, C. C., Chen, M. K., & Lee, K. M. (2001). Shifts recognition in correlated process data using a neural network. International Journal of Systems Science, 32(2), 137–143. Guh, R. S., & Hsieh, Y. C. (1999). A neural network based model for abnormal pattern recognition of control charts. Computers and Industrial Engineering, 36(1), 97–108. Guh, R. S., & Tannock, J. D. T. (1999). A neural network approach to characterize pattern parameters in process control charts. Journal of Intelligent Manufacturing, 10(5), 449–462. Guh, R. S. (2005). A hybrid learning-based model for on-line detection and analysis of control chart patterns. Computers and Industrial Engineering, 49(1), 35–62. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3(1), 1157–1182. Montgomery, D. C. (2005). Introduction to statistical quality control. New York, USA: Wiley. Page, E. S. (1961). Cumulative sum chart. Technometrics, 3(1), 1–9. Roberts, S. W. (1959). Control chart tests based on geometric moving averages. Technometrics, 42(1), 97–102. Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222. STATISTICA (2009). STATISTICA Data Miner. Tulsa, USA: StatSoft, Inc. Sun, R., & Tsung, F. (2003). A kernel-based multivariate control chart using support vector methods. International Journal of Production Research, 41(13), 2975–2989. Vapnik, V. (1998). Statistical learning theory. New York, USA: John Wiley and Sons. Wang, T. Y., & Chen, L. H. (2002). Mean shifts detection and classification in multivariate process: A neural-fuzzy approach. Journal of Intelligent Manufacturing, 13(3), 211–221. Zorriassatine, F., & Tannock, J. D. T. (1998). A review of neural networks for statistical process control. Journal of Intelligent Manufacturing, 9(3), 209–224.