~ Pergamon
Int. J. Mach.Tools Manufact.Vol. 36. No. 12. pp. 1307-1319. 1996 Copyright© 1996ElsevierScienceLtd Printedin Great Britain.All right,;reserved 0890-6955/96515.00 + .00
PII: S0890--6955(96)00054-5
MODELLING PROCESS MEAN AND VARIATION WITH MLP NEURAL NETWORKS T. W. LIAOI" (Original received 1 May 1995; in final form 15 May 1996) A b s t r a c t - - M o s t industrial processes are intrinsically noisy and non-deterministic. To date, most multilayer perceptron (MLP)-based process models were established for process mean only. This paper proposes an approach to modelling the mean and variation of a non-deterministic process simultaneously using a MLP network. The input neurons consist of process variables and one additional neuron for the z,, value. The corresponding output responses are calculated based on Yi = ~.+.ti_ ,. asp/x/-k. The process vari~mce s~ is determined by pooling the individual sample variances for k experimental conditions. Each sample variance is calculated from the replicated data. The effects of a number of hidden neurons and learning algorithm,'; are studied. Two learning algorithms are applied. They are the back-propagation with momentum (BPM) and Fletcher-Reeves (FR) algorithms. The effectiveness of the proposed approach is tested with a fictitious process and an actual manufacturing process. The test results are provided and discussed. Copyright © ! 996 Elsevier Science Ltd
1.
INTRODUCTION
Most industrial processes are intrinsically noisy and non-deterministic. One often observes that, for constant input conditions, the process output will vary over some range. Knowing both the mean and the variance of such a non-deterministic process is very important to its design, modelling and control. Traditionally, process design, modelling and control have been achieved by using the experimental design procedure, regression analysis, statistical analysis, statistical quality control technique, etc. Since the development of the back-propagation (BP) algorithm [1], multilayer perceptron (MLP) neural networks have become a popular technique for modelling manufacturing processes, among many other applications. It has been theoretically proven that any continuous mapping from an m-dimensional real space to an n-dimensional real space can be approximated within any given permissible distortion by the, three-layered feedforward neural network with enough intermediate units
[2-4]. Multilayer perceptron-based modelling applications have been reported for the turning process [5, 6], grinding process [7], welding process [8], injection moulding process [9], low pressure chemical vapor deposition process [10], plasma etching process [11, 12], composite board manufacturing process [13], etc. Multilayer perceptron-based process modelling is usually achieved with some limited training samples obtained from experiments. Almost with.out exception, replicated experimental data of different values of a particular output response are often reduced to the average value because of the inability of MLP neural networks to learn conflicting data (conflicting data occur when two identical input patterns have entirely different outputs). The result is that only the process mean is captured. Since it is as important to characterize process variation, a solution must be found to capture process variation information as well. To date, there are still a very limited number of studies addressing this issue. Davis [14] employed an artificial neural network (ANN) average model and an ANN noise model to learn a pure Gaussian bump with spatially varying Gaussian noise. The ANN noise model was developed after the ANN average model had been trained. The
tIndustrial and Mmmfacturing Systems Engineering Department, Louisiana State University, Baton U.S.A. 1307
Rouge,
1308
T.W. Liao
"input/output noise pattems" for training the noise model were computed as the absolute errors between the ANN average model estimates and the available experimental samples used to train the average model. The output noises thus calculated are very likely to include the training errors that always exist in obtaining the average model. Nix and Weigend [15] used a split-hidden-unit architecture to learn the average and the variance of the probability distribution of a single target as a function of its inputs, given an assumed Gaussian target error-distribution model. Connection weights for the variance have to be prevented from updating until the average model is obtained. In addition, a moderately sized data set is required because the number of hidden neurons is relatively high. This paper presents an innovative approach to modelling the mean and variation of a non-deterministic manufacturing process simultaneously using a MLP neural network. The details of the proposed approach are described in Section 2. The algorithms used to train MLP neural networks are detailed in Section 3. The proposed approach is first tested with a fictitious process and then tested with a creep feed grinding process. The test results are provided in Section 4, followed by the discussions. The conclusions are given in the last section. 2. PROPOSEDMETHODOLOGY It is understood that, for a non-deterministic process, the same variable setting always yields different values for its responses. The process responses are usually normally distributed and can be easily converted into standard normal distribution by setting z = ( X - X ) / t r , if the process variance o-2 is known. Generally, oa is usually unknown and must be estimated from experimental data. Our methodology determines the process variance based on a pooling of individual sample variances calculated from replicated data corresponding to each experimental condition. Let nj, n2..... nk be the number of observations (replicates) and s2,s 2 ..... s 2 be the sample variances for k different experimental conditions, the pooled estimate of process variance can be calculated as Sp2 =
111$2 + 1)2$2 + ... + ltk$2 /21 +
(1)
152 + . . . + Vk
ni
E cr,,- ; ) S/2 - - t = l
hi--1
(2)
and ni Z --
Yi--
Yti
t=l
ni
(3)
where vl = ( n l - 1), v2 = ( n 2 - 1 ) . . . . . 1)k = (n k - 1) are degrees of freedom for conditions 1, 2 ..... k, and Y,i is an observation for condition i. The proposed methodology uses one integrated architecture to train both process mean and process variance. The input neurons consist of process variables and one additional neuron for the z value. "Input/output-mean patterns" with z=0 are used to train the process mean. On the other hand, "input/output-variance patterns" with z=+z~ and - z ~ are used to learn the process variance. The output responses corresponding to z=+z,, (-z~) are calculated based on y, = y-~ + ( - ) t ~ _ , ~, Sp , ~"
(4)
Modelling Process Mean and Variation with MLP Neural Networks
1309
Generally, z~=3 is used to ensure good generalization within +3o'. The ot values, meaning the probability of z>z~, are 0.0013, 0.028, 0.1587 when z,,=3, 2, 1, respectively. Before Equation (4) can be used, tk-~,~ has to be found. To this end, the tk-l.0.00~3, tk-l.0.0ZS, tk-~,0.~587 values are computed according to Zelen and Severo [16] for k - 1 (or v)=l, 2, .... 50, as provided in Table 1. The proposed methodology has the following salient features: (1) it can handle multiresponse; (2) it learns both process mean and process variance using one integrated MLP model; (3) the aJrchitecture is simple, requiring only one additional input neuron other than those required for input variables; (4) it utilizes all the information contained in the replicated data to generate the "input/output-variance" training patterns; (5) it does not require a special learning algorithm--standard learning algorithms such as back-propagation can be applied; and (6) the learned MLP-based model can be easily updated (relearned) to reflect changes in the process mean and/or the process variation. 3.
TRAINING ALGORITHMS
The multilayer perceptron is one of many commonly used neural networks. Figure 1 shows the structure of a MLP with two hidden layers. The back-propagation algorithm is the most popular training algorithm for MLP networks [17]. It minimizes the total sum of square (TSS) error using the gradient descent method [1]. The error function Ep is usually defined as n-I
(5)
1 ~('dp i--Op, i)2, p = 1. . . . . Q, i=0
where d,,,. and op, i are the desired ith output and the corresponding network actual output for pattern p. Ep approximates to 0 as mapping between inputs and outputs for pattern p is realized. The gradient descent method is applied to update the network parameters by an amount proportional to the partial derivatives of the error function E, with respect to the given Table 1. tv,o.oo~3,t~,o.o28, tv,o.ls87 computed based on Zelen and Severo [16] v
fv,O.~Ol 3
tv.o.ozs
tv.0.1587
v
tv.o.ool3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
244.:5850 19.5:658 9.3352 6.6862 5.5531 4.9,108 4.5607 4,3039 4.11188 3.9798 3.8712 3.7846 3.7]134 3.6544 3.6043 3.5616 3.5244 3.4921 3.4634 3.4382 3.4153 3.3950 3.3765 3.35;98 3.3444
13.9362 4.5211 3.3038 2.8671 2.6468 2.5149 2.4273 2.3650 2.3185 2.2824 2.2536 2.2301 2.2106 2.1941 2.1800 2.1678 2.1571 2.1477 2.1394 2.1319 2.1252 2.1192 2.1137 2.1086 2.1 040
1.8367 1.3210 1.1966 1.1414 1.1103 1.0904 1.0765 1.0663 1.0585 1.0524 1.0474 1.0433 1.0398 1.0368 1.0343 1.0321 1.0301 1.0284 1.0268 1.0255 1.0242 1.0231 1.0220 1.0211 1.0202
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
3.3305 3.3175 3.3058 3.2946 3.2846 3.2749 3.2662 3.2578 3.2501 3.2427 3.2360 3.2294 3.2234 3.2175 3.2122 3.2069 3.2021 3.1973 3.1929 3.1886 3.1846 3.1806 3.1770 3.1734 3.1701
/v,0.028
tv,O. 1587
2.0998 2.0959 2.0923 2.0889 2.0858 2.0829 2.0802 2.0776 2.0752 2.0730 2.0709 2.0688 2.0670 2.0652 2.0635 2.0618 2.0603 2.0588 2.0574 2.0561 2.0548 2.0536 2.0525 2.0513 2.0503
1.0194 1.0187 1.0180 1.0174 1.0168 1.0162 1.0157 1.0152 1.0147 1.0143 1.0139 1.0135 1.0132 1.0128 1,0125 1.0122 1.0119 1.0116 1.0113 1.0111 1.0108 !.0106 1.0104 1.0101 1.0099
1310
T . W . Liao Input layer
X,-I
,
~-,
,® J
2rid hidden layer
1st hidden layer
1
output layer
neuron
,,i
J
Fig. 1. General structure of a back-propagation neural network with two hidden layers.
parameters as shown in Equation (6) for pattem learning or Equation (6') for batch learning.
ae.(~ AR(T) = - r I 0R(T) '
(6)
or
o ae~(~ OR(T)'
AR(T)=-~=,
(6')
where T is the learning iteration, R is the network parameter, and ~/is the learning rate. The network parameter R can be the connection weights u,.j, via., wij between neurons i and j, the threshold tkj of neuron j in layer k, the maximum output Ok,j of neuron j in layer k, or the sigmoid slope/3aj of neuron j in layer k as shown in Fig. 1. It can be easily seen from Equations (6) and (6') that pattern leaming updates the weights after every presentation of an individual pattern. Contrarily, batch learning takes the whole set of training patterns before updating the weights. A momentum factor is often added to increase the learning speed [18]. Accordingly, Equations (6) and (6') are changed to Equations (7) and (7'), respectively.
0E.(73 AR(T) = - 7 / ~ + 7*AR(T-1), o ~E,,(73 AR(T) = -rl~=~ 3R(-----~+ T*AR(T-I)'
(7)
(7')
where 3' is the momentum factor with value 0-<7<1. Despite this momentum factor, the learning procedure is sometimes still hampered by the occurrence of local minima. Several approaches have been proposed for improving the learning procedure. They can be generally divided into four categories: (1) better choice of starting points of the training process (determined by the weights of the net) to enable faster convergence; (2) better choice of the neuron activation function;
Modelling Process Mean and Variation with MLP Neural Networks
1311
(3) better specification and adaption of the MLP topology (number of hidden layers and number of hidden neurons within each hidden layer); (4) use of a better learning algorithm. Usually these improvements are all independent of one another, and they can be used together in an ,aptimal learning scheme. In this study all initial weights are randomly generated and the sigmoid function is consistently used as the neuron activation function. For comparing their performance, the Fletcher-Reeves (FR) algorithm, detailed in Section 3.1, and the BPM method are used to train MLP process models. The effect of the number of hidden neurons is also investigated. One key issue of a gradient-based MLP learning algorithm is to find the partial derivatives of Ep with respect to the network parameters. Given the mathematical function of each neuron and the connection between every two neurons in the adjacent layers, the partial derivatiw~.s of Ep can be analytically obtained by the generalized chain rule. According to the network topology shown in Fig. 1, the partial derivatives of Ep with respect to Ui.j, Yi.j Wij, tkd, Okj and [3ka can be derived directly. Once the partial derivatives of Ep with respect to u~j, v~a, w,.,j, tkj, Ok,/ and fl~j are obtained, they can be substituted into Equations (7) and (7') to find the training formulas for pattern learning and batch learning, respectively. Refer to [7] for the derivation of the partial derivatives of Ep with respect to uia, v~,j, wij, tzj, ®~,j and [3~j and the training formulas for pattern learning and batch learning.
3.1.
Conjugate gradient learning algorithm
Theoretically, any traditional optimization technique such as the steepest descent method with line search, the conjugate gradient (CG) method and the variable metric (VM) method can serve as the MLP training algorithm. So can any heuristic technique such as the simulated annealing method, the tabu search algorithm and the genetic algorithm. Refer to [8] for a comparative study of the standard BP, simulated annealing and tabu search algorithms. We are particularly interested in the Fletcher-Reeves (FR) algorithm, which is a CG method. The CG method possesses the positive characteristics of the steepest descent (Cauchy) and Newton methods. They are reliable and effective, as is the steepest descent method, when the search point is far from local minima. They also accelerate efficiently, as does the Newton method, when the search point approaches a local minimum. The general algorithm of CG methods is given as follows [18]. 3.1.1. General gradient-based method Step 1. Define M--the maximum number of allowable iterations E=the criteria function to be minimized N=the number of variables Rt°)=[U, V, W, T, O, B] representing the initial estimates of MLP parameters, where
U=[uj, V=[vj, W=[wj
Step 2. Step 3. Step 4. Step 5. Step 6.
T = [ t j , O = [ O j and n=[/3,j] e=-the overall convergence criterion ct=the initial step size Set/~-=0. Is E(R<*))M? Yes: Print "termination: k=-M" and go to Step 12. No: Continue. Calculate Gtk)=VE(R~k)). Calculate s(Rtk)), the search direction in the N- dimensional sub-space of the design variables Ru.
1312
T.W. Liao Is G(k)r.s(R~k))
Step 7.
Step 8. Step 9.
Step 10.
Is E(R(k+1))
Step 11. Step 12.
Yes: Go to Step 11. No: Print "termination: no descent" and go to Step 12. Set k=k+l and go to Step 3. Stop.
Note that a line search method is needed in Step 8 to find the search direction. To this end, the backtracking line search technique [19], as detailed below, is used. 3.1.2. Backtracking line search method Step 1. Set Ao=l.0, AA--0.1 and j=l. Set the lower and upper bounds of A-j/Aj_~: 1=0.1 and h--0.5. Set I/t=--10-4. Step 2. Is E(R (k) + s(R(k))) > E(R ~k) + G(k)r's(R(k))? Yes: Go to Step 5. No: Continue. Step 3. A~A~_~+ AA, j=j+l. Step 4. Is E(R ~k) + Afl(R(k))) > E(R (k)) + AjG(k)r.s(R(k))? Yes: Set a~k)=Aj_l and return. No: Go to Step 3. Step 5. Is E(R E(R
--G(k)r.s(R(k)) Yes: A1 = 2[E(R(k) + s(R(k)))_E(R(k))_Gtk),.s(R(k)) ] and continue.
- b + ~b2-3a*G~k)'.s(R
3a
and continue
where
[:]
1 l
Xj ,-xj-2
A~-2
>( --~j-2
Aj-t
E(R ~k) + Aj_ is(R(k))) - E(R Ck))- Aj_ iG(k)r.S(R(k))
X E(R (k) + AJ-2$(R~k)))-E( R~k))- Aj_2GCk)r.s(R(k) )
Step 7.
Step 8.
Step 9.
Step 10.
Does Ag exist? Yes: Continue. No: Set a(k)=Aj_ 1 and return. Is Aj>h*Xj _1 ? Yes: Set Afh*Aj --1 and go to Step 10. No: Continue. Is )tj
In practice, ~ is set very small to guarantee the decrease of the function value. Here
Modelling Process Mean and Variation with MLP Neural Networks
1313
I/r=10 -4 is used. I and h are the lower and upper bounds to prevent Aj from decreasing or increasing too quickly within each line search iteration. A minimum allowable step length called a minstep is imposed. If the condition in Step 5 is satisfied, but IlAj_ls( >)ll is smaller than the minstep, then the line search is terminated. This criterion prevents the line search from looping forever if s(R ~k)) is not a descent direction. (This sometimes occurs at the final iteration of learning algorithms owing to finite precision errors.) A maximum allowable step length, called a maxstep, is also imposed to prevent taking steps that would result in leaving the domain of interest and possibly causing an overflow in the computation. For defining the appropriate line search direction in Step 6 of the general gradient-based method, the FR method uses
s(R <°)) = - ' G °) and s(R Ck))= - G ~t) +
IIG~k)ll2
[ IIG< -I)ll2
].s(RCk-1)).
(8)
Restarting is necessary (by setting s(R
TEST RESULTS
The proposed approach is first tested with a fictitious process and then tested with an actual manufacturing process. For both cases, training was also performed to obtain the MLP models for process mean only to compare their results with the results obtained using the propo,;ed approach. The performances of different neural network architectures and learning algorithms are also studied. In all experiments, batch learning was used. The maximum allowable error was consistently set at 0.0001 to ensure that no training was stopped before completing the specified number of iterations. For the BPM method, the learning rate of 0.4 and momentum factor of 0.8 are consistently used. Each simulation run is repeated three times to obtain the average performance and the standard deviation. Each simulation run generates a performance value for both the training error and the generalization error. The performance is measured by the mean relative error (MRE), which is defined as
Q n-I la~,,-op,,I
E E o e,,
p=l
'=
/no..
(9)
All simulations were executed using an IBM 486/33 Hz personal computer. The test results, as summarized in Tables 4 and 5, are presented in Sections 4.1 and 4.2. 4.1.
Test results of a fictitious process
A two-input, :q and x2, two-output, Yt and Y2, non-deterministic process is used to test the proposed methodology. The process is assumed to have the following functions: Yl = Xl + x:z + N(0, 2),
(lo)
Y2 = XlX2 + N(O, 5).
(11)
and
First, both the BPM and FR algorithms were used to train 2x3x2 and 2x5x2 MLP networks for process mean only. The training data consisting of nine examples are provided in Table 2. For the 2x3x2 architecture, the BPM and FR algorithms on average took 443 and 343 see, respectively, to complete 1000 iterations of learning. The average MRE for the 2x3x2 MLP model trained with BPM and FR are 0.0345 and 0.0245, respectively. For the 2x5x2 architecture, the BPM and FR algorithms on average took 589 and
T. W. Liao
1314
Table 2. Training and testing data for the fictitious process (process mean only) No.
x~
x2
Yl
Ya
5 10 15 5 10 15 5 10 15
5 5 5 10 10 10 15 15 15
10 15 20 15 20 25 20 25 30
25 50 75 50 100 150 75 150 225
7 13 7 13 9 11 9 11 6
7 7 13 13 9 9 11 11 14
14 20 20 26 18 20 20 22 20
49 91 91 169 81 99 99 121 84
Training data 1 2 3 4 5 6 7 8 9
Tes~ngdata 1 2 3 4 5 6 7 8 9
367 sec, respectively, to complete 1000 iterations of learning. The average MRE for the 2x5x2 MLP model trained with BPM and FR are 0.0085 and 0.0075, respectively. Once the MLP models are successfully trained, their generalization capabilities are then tested using another set of data. The testing data are also provided in Table 2. The average generalization MRE for the 2×3x2 MLP model trained with BPM and FR are 0.0475 and 0.0555, respectively. On the other hand, the average generalization MRE for the 2x5x2 MLP model trained with BPM and FR are 0.0185 and 0.018, respectively. Next, both the BPM and FR algorithms were used to train 3x5x2 and 3x12x2 MLP networks for modelling process mean and process variance using the proposed methodology. The training data consist of 27 examples, each with three inputs and two outputs. Among them, nine examples are identical to the training data in Table 2 except that they have an additional input x3 with value 0 (meaning z=0). Keeping the xl and x2 values unchanged, nine more examples are generated by setting z (or x3)=+3. For example, one "input/output-variance" training pattern is (xl, x2, x3, y~, y2)=(5, 5, 3, 16, 40) in which 16=5+5+3.2 and 40=-5*5+3*5, according to Equations (10) and (11). Similarly, nine more examples are generated by setting z (or x3)=-3. For the 3x5x2 architecture, the BPM and FR algorithms on average took 1034 and 899 sec, respectively, to learn the MLP model for 1000 iterations. The average MRE for the 3x5x2 MLP model trained with BPM and FR are 0.0455 and 0.035, respectively. The values are 0.026 and 0.023 if error is calculated based only on the nine examples with x3--0. For the 3x12x2 architecture, the BPM and FR algorithms on average took 1473 and 1571 sec, respectively, to learn the MLP model for 1000 iterations. The average MRE for the 3x12x2 MLP model trained with BPM and FR are 0.041 and 0.015, respectively. The values become 0.023 and 0.013 if error is calculated based only on the nine examples with X3~---'0. Once the MLP models are successfully trained, their generalization capabilities are then tested using another set of data. The testing data also consist of 27 examples. Among them, nine examples are identical to the testing data in Table 2 except that they have an additional input x3 with value 0 (meaning z=0). Using the same set of nine combinations of x~ and x2 values in the training data, nine more testing examples are generated by setting z (or x3)=+1. For example, one "input/output-variance" testing pattern is (x~, x2, x3, yj, y2)=(5, 5, 1, 12, 30) in which 12=5+5+1.2 and 30=5.5+1,5. Similarly, nine more testing examples are generated by setting z (or x3)=- 1. The average generalization MRE for the 3x5x2 MLP model trained with BPM and FR
Modelling Process Mean and Variation with MLP Neural Networks
1315
are 0.032 and 0.034, respectively. The values are 0.036 and 0.033 if error is calculated based only on the nine examples with x3--0. On the other hand, the average generalization MILE for the 3×12x2 MLP model trained with BPM and FR are 0.031 and 0.021, respectively. The values become 0.027 and 0.017 if error is calculated based only on the nine examples with x3=0. 4.2.
Test results o f an actual manufacturing process
The experimental data from a creep feed grinding process [20] were also used to establish MLP-based models for process mean only, using the conventional approach, and MLPbased models for process mean and process variance, using the proposed approach. The experiments involved grinding of alumina with diamond wheels. Diamond grinding is the most popular finishing operation for ceramic materials. Unlike conventional grinding, creep feed grioding is performed at low work speed and large depth of cut. Creep feed grinding is usually conducted in one pass and has the potential to produce high quality parts at high productivity. The experimental data, as given in Table 3, were obtained according to a 2: - l design with bond (Xl), mesh size (x2), concentration (x3), work speed (x4) and depth of cut (xs) as variables varying at two levels. Three process responses were measured. They are the roughness (Yl), the specific normal force or normal force per unit width (Y2), and the specific grinding power (Y3). For each experimental condition, two observations were taken. The polled estimates of process variance s 2 for Yl, Yz and Y3 are calculated to be (0.3965) 2, (15.0686) 2 and (0.2974) 2, respectively, from the experimental data. First, both the BPM and FR algorithms were applied to train 5x4x3 and 5x10x3 MLP networks for process mean only, using all the experimental data as the training data. For the 5×4×3 archkecture, the BPM and FR algorithms on average took 1169 and 700 sec, respectively, to complete 1000 iterations of learning. The average MRE for the 5×4x3 MLP model trained with BPM and FR are 0.087 and 0.026, respectively. For the 5x10x3 architecture, the BPM and FR algorithms on average took 1836 and 1041 sec, respectively, to complete 1000 iterations of learning. The average MRE for the 5 x l 0 x 3 MLP model trained with BPM and FR are 0.04 and 0.003, respectively. Next, both the BPM and FR algorithms were used to train 6×6x3 and 6×12×3 MLP networks for modelling process mean and process variance using the proposed methodology. The trainJing data consist of 48 examples. Among them, 16 examples are identical to the training data in Table 3 except that they have an additional input x6 with value 0 (meaning z=0). Keeping the first five input values unchanged, 16 more examples are generTable 3. Training data obtained from the creep feed grinding process (process mean only) B~
m
c
0 0 0 0 0 0 0 0 1 1 1 1 1 I 1 1
80 180 80 180 80 180 80 180 80 180 80 180 80 180 80 180
50 50 100 100 50 50 100 100 50 50 100 lOO 50 50 100 100
f (in./min-~) (ram s -t) 6.8 (2.9) 6.8 (2.9) 6.8 (2.9) 6.8 (2.9) 16.2 (6.9) 16.2 (6.9) 16.2 (6.9) 16.2 (6.9) 6.8 (2.9) 6.8 (2.9) 6.8 (2.9) 6.8 (2.9) 16.2 (6.9) 16.2 (6.9) 16.2(6.9) 16.2 (6.9)
a(10-3 in) (mm)
Ra (/xin.) (/xm)
58 (1.47) 102 (2.59) 102 (2.59) 58 (1.47) 102 (2.59) 58 (1.47) 58 (1.47) 102 (2.59) 102 (2.59) 58 (1.47) 58 (1.47) 102 (2.59) 58 (1.47) 102 (2.59) 102 (2.59) 58 (1.47)
34.28 (0.87) 32.92 (0.836) 34.32 (0.871) 26.7 (0.678) 35.6 (0.904) 33.4 (0.848) 35.16 (0.892) 27.74 (0.704) 34.185(0.868) 22.445(0.57) 31.625(0.803) 32.97 (0.837) 35.63 (0.904) 24.08 (0.611) 34.68 (0.88) 33.725(0.856)
F, (lb/in) (N/mm)
P'w (Hp/in.) (kW/m)
253.5 (44.4) 7.385(217) 378.5 (66.3) 11.465(337) 370 (64.8) 10.7 (315) 320 (56.0) 7.895(232) 559.5 (98.0) 20.125 (592) 441 (77.2) 12.985(382) 404 (70.8) 11.465(337) 640 (112.1) 22.155 (651) 212 (37.1) 5.345(157) 179.5 (31.4) 3.67 (108) 157 (27.5) 3.565(105) 159 (27.9) 4.99 (147) 228.5 (40.0) 6.365(187) 434 (76.0) 12.48 (367) 362.5 (63.5) 9.27 (273) 184 (32.2) 5.91 (174)
Note: 0, I denote resinoid and vitrified bond, respectively, xl, x~ x3, x4, x~, y~, Y2 and Y3 ale B, m, c, f, a, Ro ~ , and P'w, respectiw:ly.
1316
T.W. Liao
ated by setting z (or X6)=+3. Since the population variance is unknown, Equation (4) is used. For example, one "input/output-variance" training pattern is (x~, x2, x3, x4, xs, x6, y~, Y2, Y3)=(.0, 80, 50, 6.8, 58, 3, 34.64, 267.1, 7.653) in which 34.64=34.28+3.6043* (0.3965N16), 267.1=253.5+3.6032.(15.0686/~/16), and 7.653=7.385+3.6032,(0.2974/~/16). Similarly, 16 more examples are generated by setting z (or x6)=-3. For the 6x6x3 architecture, the BPM and FR algorithms on average took 2388 and 2452 sec, respectively, to complete 1000 iterations of learning. The average MRE for the 6x6x3 MLP model trained with BPM and FR are 0.036 and 0.023, respectively. The values are 0.021 and 0.020 if error is calculated based only on the 16 examples with x6=0. For the 6x12x3 architecture, the BPM and FR algorithms on average took 3639 and 3702 sec, respectively, to complete 1000 iterations of learning. The average MRE for the 6x12x3 MLP model trained with BPM and FR are 0.01 and 0.008, respectively. The values become 0.006 and 0.002 if error is calculated based only on the 16 examples with x6=0. Once the MLP models are successfully trained, their generalization capabilities are then tested using another set of data. The testing data consist of 32 examples. Keeping the first five input values the same as in the training set, 16 testing examples are generated by setting z (or x6)=+l. Since the population variance is unknown, Equation (4) is actually used. For example, one "input/output-variance" training pattern is (Xl, x2, x3, x4, xs, x6, y~, Y2, Y3)=(0, 80, 50, 6.8, 58, 1, 34.38, 257.4, 7.462) in which 34.38=34.28+1.0343, (0.3965/~/16), 257.4=253.5+1.0343.(15.0686/~/16), and 7.462=7.385+1.0343.(0.2974/~/16). Similarly, 16 more testing examples are generated by setting z (or x6)=-1. The average generalization MRE for the 6x6x3 MLP model trained with BPM and FR are 0.031 and 0.021, respectively. On the other hand, the average generalization MRE for the 6x12x3 MLP model trained with BPM and FR are 0.007 and 0.004, respectively. 5. DISCUSSIONS The MLP architectures with smaller number of hidden neurons were designed according to the relationship established by Mirchandani and Cao [21]. That is,
(12) k=O /C/ where Tp is the minimum number of training patterns required, H is the number of hidden neurons, and d is the input space dimension. For instance, the number of hidden neurons for training the process mean and process variation for the fictitious process is five given d=3 and Tp=27. The MLP architectures with larger numbers of hidden neurons were arbitrarily determined to have approximately twice the number of hidden neurons as the MLP architectures with smaller numbers of hidden neurons. As expected, the results show that a MLP model with a larger number of hidden neurons requires longer training time because more calculations are involved (see Tables 4 and 5). For both processes tested, the results indicate that MLP models with larger numbers of hidden neurons produce smaller training errors and generalization errors for the BPM method as well as the FR algorithm. The results suggest that a larger number of hidden neurons than that calculated based on Equation (12) should be used. Theoretically, an increase in the number of hidden neurons would reduce the training error after enough training, but might increase the generalization error. This phenomenon, called overfitting, happens due to the network's learning the variance of the training data (not the process variance discussed in this text). To the best of our knowledge, no reliable theoretical equation can be used to determine the optimum number of hidden neurons for a given application. Murata et al. [22] proposed a network information criterion (NIC) to measure the relative merits of two models that have the same structure but a different number of parameters. The criterion can be applied to determine whether more neurons should be added to a network. The alternative is to conduct a comprehensive simulation study to
Modelling Process Mean and Variation with MLP Neural Networks
1317
Table 4. Test results of MLP models with a smaller number of hidden neurons BPM (1000 iterations)
FR (1000 iterations)
CPU time (sec) Training MRE Generalization MRE
443 (25) 0.0345 (0.012) 0.0475 (0.011)
343 (40) 0.0245 (0.002) 0.0555 (0,007)
CPU time (sec) Training MRE - - overall - - mean only Generalization MRE - - overall - - mean only
1034 (217)
899 (57)
0.0455 (0.025) 0.026 (O.O1)
0.035 (0.031) 0.023 (0.01)
0.032 (0.016) 0.036 (0.016)
0.034 (0.005) 0.033 (0.012)
Average CPU time (sec) Average training MRE
1169 (67) 0.087 (0.028)
700 (17) 0.026 (0.007)
Average CPU time (sec) Average training MRE - - overall - - mean only Average generalization MRE - - overall
2388 (50)
2452 (60)
0.036 (0.015) 0.021 (0.001)
0.023 (0.012) 0.020 (0.013)
0.031 (0.014)
0.021 (0.012)
Fictitious process Mean only (2×3x2)
Mean and variation (3x5x2)
Creep feed grinding process Mean only (5x4x3) Mean and variation (6x6x3)
Note: 443 (25) denotes that the average CPU time is 443 sec and the standard deviation of CPU time is 25 sec.
Tahle 5. Test results of MLP models with larger number of hidden neurons Learning algorithm
Iterations
BPM
BPM
FR
FR
1000
3000
1000
3000
Fictitious process Mean only (2x5x2) CPU time (sec) qI'raining MRE Generalization MRE Mean and variation (3x12x2) CPU time (sec) Training MRE - - overall - - mean only Generalization MRE - - overall - - mean only
589 (59) 0.0085 (0.001) 0.0185 (0.006)
367 2(89) 0.0075 (0.001) 0.018 (0.005)
1473 (1)
4421 (31)
1571 (32)
4802 (261)
0.041 (0.011) 0.023 (0.009)
0.026 (0.016) 0.015 (0.005)
0.015 (0.002) 0.013 (0.003)
0.013 (0,003) 0.011 (0.015)
0.031 (0.009) 0.027 (0.01)
0.028 (0.009) 0.031 (0.015)
0.021 (0.003) 0.017 (0.003)
0.019 (0.004) N0.013 (0.003)
Creep feed grinding process Mean only (5x10x3) C,PU time (sec) Training MRE Mean and variation (6x12x3) C',PU time (sec) Training MRE - - overall - - mean only C~eneralization MRE - - overall
1836 (203) 0.04 (0.0027)
1041 (12) 0.003 (0.001)
3639 (21)
10943 (127)
3702 (78)
10883 (76)
0.01 (0.001) 0.006 (0.002)
0.007 (0) 0.002 (0.001)
0.008 (0.001) 0.002 (0,002)
0.006 (0) 0.002 (0.001)
0.007 (0.001)
0.003 (0.001)
0.004 (0.001)
0.003 (0.001)
Note: 589 (59) denotes that the average CPU time is 589 sec and the standard deviation of CPU time is 59 sec.
1318
T.W. Liao
find the optimum value. It is not our intention to find the optimum value in this study. Therefore, a comprehensive simulation study was not performed. Our results show that the FR algorithm always produces a MLP model with a lower training error than the BPM method. The FR algorithm takes less time to learn the meanonly MLP process models but, most of the time, requires longer time to learn the meanand-variance process models. Also, MLP models learned by the FR algorithm always have a lower generalization error if the models are specified with a larger number of hidden neurons. Neither algorithm has an edge in obtaining a lower generalization error model if a smaller number of hidden neurons is specified. Moreover, the test results show that the proposed methodology is effective in modelling process mean and process variation simultaneously using one integrated MLP model. For the creep feed grinding process, the (mean only) training and generalization errors of the process mean-and-variation MLP models are consistently smaller than those errors of the comparable process mean-only MLP models. For the fictitious process, the same conclusion can be made for the MLP model with a smaller number of neurons. The only exception is the MLP model with a larger number of neurons for the fictitious process. Even so, the (mean only) training and generalization errors of this MLP model trained with the BPM (FR) algorithm are low: 0.023 and 0.027 (0.013 and 0.017), respectively. This suggests that the MLP model is good enough to capture both process mean and process variation information. To investigate whether a better performance can be obtained by increasing the maximum number of learning iterations, the 3x12×2 MLP model for the fictitious process and the 6x12×3 MLP model for the grinding process were also trained by both learning algorithms using 3000 iterations of learning (Table 5). For the 3x12x2 MLP model of the fictitious process, the BPM and FR algorithms on average took 4421 and 4802 sec, respectively, to learn the MLP model for 3000 iterations. The average MRE for the MLP model trained with BPM and FR are 0.026 and 0.013, respectively. The values are 0.015 and 0.011 if the error is calculated based only on the nine examples with x3--0. The average generalization MRE for the MLP model trained with BPM and FR are 0.028 and 0.019, respectively. The values become 0.031 and 0.013 if the error is calculated based only on the nine examples with x3=0. For the 6x12x3 MLP model of the creep feed grinding process, the BPM and FR algorithms on average took 10 943 and 10 883 sec, respectively, to complete 3000 iterations of learning. The average MRE for the MLP model trained with BPM and FR are 0.007 and 0.006, respectively. Both values become 0.002 if the error is calculated only based on the 16 examples with x6--0. The average generalization MRE for the MLP model trained with BPM and FR are equal, with a value of 0.003. These results suggest that both training and generalization errors can be reduced if the allowable number of iterations increases from 1000 to 3000. However, the cost is a longer training time. The significance of performance difference between two treatments are statistically tested using the hypotheses testing method. It is assumed that the means of both treatments are normally distributed with unknown variances. According to [23], Ho:/~l =/x2 is rejected if to > ta/2.v or to<-to02.v. The values of to and v are calculated as follows: to
(RI-X2)/
(13)
In1 + ~ l n z ,
(~lnl + ~/n2) 2 v = ( ~ / n l ) 2 / ( n I + 1) + (~/nz)2/(n2 + 1)
-2.
(14)
For instance, we reject HO:~.~I=~_L2at the c~--0.10 level of significance when testing the hypothesis that the mean training errors of the 6×12×3 MLP model for the creep feed grinding process learned by the BPM and FR algorithms are the same. That is, there is strong evidence indicating that their mean training errors are different.
Modelling Process Mean and Variation with MLP Neural Networks 6.
1319
CONCLUSIONS
We have presented a methodology to model the process mean and process variation of a non-deterministic process simultaneously using a MLP neural network. The effectiveness of the proposed methodology is demonstrated using a fictitious process and an actual manufacturing process as examples. Both processes were trained by the BPM and FR algorithms with two different MLP architectures: one with a smaller number of hidden neurons and one with a larger number of hidden neurons. Based on the results obtained in this study, the following conclusions are made: (1) The proposed methodology is effective in modelling process mean and process variation simultaneously using one integrated MLP model. (2) MLP model,; with larger number of hidden neurons produce equivalent or smaller training error and generalization error for the BPM method as well as the FR algorithm. However, overfitting should be avoided to prevent increasing generalization error. (3) The FR algorithm always produces a MLP model with lower training error than the BPM method. (4) Both training and generalization errors can be further reduced at the cost of training time if the allowable number of iterations is increased. REFERENCES [1] D. E. Rumelhart, G. E. Hinton and R. J. Williams, Learning internal representation by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, Foundations (edited by D. E. Rumelhart and J. L. McClelland). MIT Press, Cambridge, MA (1986). [2] K. Funahashi, On the approximate realization of continuous mappings by neural networks, Neural Networks 2, 183-192 (19'89). [3] K. Hornik, M. Stinchcombe and H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2, 359-366 (1989). [4] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks 4, 251-257 (1991). [5] S. S. Rangwala and D. A. Dornfeld, Learning and optimization of machining operations using computing abilities of neural networks, IEEE Trans. Syst. Man. Cybernet. 19(2), 299-314 (1989). [6] L. S. L. Y. Yerramareddy and K. F. Arnold, Developing empirical models from observational data using artificial neural networks, J. Intell. Mfg 4, 33--41 (1993). [7] T. W. Liao and L. J. Chen, A neural network approach for grinding processes: modelling and optimization, Int. J. Mach. Tools Manufact. 34(7), 917-937 (1994). [8] T. Huang, C. 2'hang, S. Lee and H. P. Wang, Implementation and comparison of three neural network learning algorit]ams, Kybernetes 22(1), 22-38 (1993). [9] A. E. Smith, Predicting product quality with backpropagation: a thermoplastic injection moulding case study, Int. J. Adv. Mfg Technol. 8, 252-257 (1993). [10] F. Nadi, A. M. Agogino and D. A. Hodges, Use of influence diagrams and neural networks in modeling semiconductor manufacturing processes, IEEE Trans. Semiconductor Mfg 4(1), 52-58 (1991). [ 11] E. A. Rietman and E. R. Lory, Use of neural networks in modeling semiconductor manufacturing processes: an example for plasma etch modeling, IEEE Trans. Semiconductor Mfg 6(4), 343-347 (1993). [ 12] B. Kim and G. 5;. May, An optimal neural network process model for plasma etching, IEEE Trans. Semiconductor Mfg 7(1), 12-21 (1994). [13] D. F. Cook and R. E. Shannon, A predictive neural network modelling system for manufacturing process parameters, Int. J. Prod. Res. 30(7), 1537-1550 (1992). [14] W. Davis, Process variation analysis employing artificial neural network. IJCNN, Baltimore, 11260-265 (1992). [15] D. A. Nix, and A. S. Weigend, Estimating the mean and variance of the target probability distribution, Proc. 1994 IEEE ICNN, Orlando (1994). [16] M. Zelen and N. C. Severo, Probability functions, In Handbook of Mathematical Functions, No. 26 (edited by M. Abramowitz and I. A. Stegun), National Bureau of Standards, Applied Mathematics Series 55. U.S. Government Printing Office, Washington, DC (1964). [17] R. P. Lippmann, An introduction to computing with neural nets, IEEE ASSP Mag. April, 4-22 (1987). [18] G. V. Reklaitis, A. Ravindran and K. M. Ragsdell, Engineering Optimization: Methods and Applications. John Wiley, New York (1983). [19] J. E. Dennis Jr and R. B. Schnabel, Numerical Methods for Unconstrained Optimization and Non-linear Equations. Prenlfice Hall, Englewood Cliffs, NJ (1983). [20] T. W. Liao, Creep feed grinding of alumina with diamond wheels. Ph.D. Dissertation, Lehigh University (1990). [21] F. Mirehandani and W. Cao, On hidden nodes for neural nets, IEEE Trans. Orcuits Systems 36(5), 661664 (1989). [22] N. Murata, S. Yoshizawa and S. Amari, Network information criterion-determining the number of hidden units for an artilicial neural network model, IEEE Trans. Neural Networks 5(6), 865-872 (1994). [23] D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers. John Wiley, New York (1994).