Copyright © IF AC Artificial Intelligence in Real-Time Control, Arizona, USA, 1998
ADAPTIVE ESTIMATION USING MULTIPLE MODELS AND NEURAL NETWORKS Tyrone L. Vincent· Cecilia Galarza .. Pramod P. Khargonekar··
• Engineering Division, Colorado School of Mines, Golden, CO 80401 ,
[email protected] •• EECS Department, University of Michigan, Ann Arbor, MI 48109
Abstract: A method is presented to combine multiple model estimation with a neural network to obtain more accurate estimates. The key idea is to use the data from the initial phase of the run for system identification, and then run a single estimator designed for the identified model for the remainder of the run. The use of multiple models and neural networks allows the on-line identification to take place extremely quickly. The method is validated on actual data from an important estimation problem in microelectronics manufacturing which is subject to model uncertainties: determining end-point to an etch step using reflectometry data. Copyright © 1998 IFAC Keywords: Adaptive Algorithms, Neural Networks, Estimation Algorithms
1. INTRODUCTION
derived the optimal least squares estimator, which consists of running several Kalman filters in parallel, the number of filters equal to the number of possible modes. The minimum variance state estimate is given by
Estimation is a key element of advanced sensing and control of complex systems. Often a model is available but may have unknown parameters, requiring adaptive estimation techniques to be applied. A useful approach to adaptive estimation is the so called multiple model estimation. In this case, the system dynamics are unknown, but assumed to be one of a fixed number of known possibilities. That is, consider the set of models
x(klk) = :Lxj(klk)P[Bj lyk]
(2)
j
x(k + 1) = f(k , x(k), u(k), Bj ) + w(k) (la) y(k) = g(k , x(k), u(k) , Bj ) + n(k) (lb)
where x(klk) is the minimum variance estimate for model j , and P [Bj lyk ] is the probability that model j is the correct model given yk , the measured data up to time k. The linear /Gaussian assumptions allow a recursive algorithm for the calculation of P[Bj lyk].
where x E Rn is the state, y E RP is the output, w E Rn is a random disturbance and n E RP is random measurement noise. The model is parameterized by Bj E S C Rd , .where S is a finite set. It is assumed that the true system captured by some model j. This approach began with the work of Magill (1965), who considered the problem under the usuallinear/ Gaussian assumptions. Magill
The multiple model approach has been used in many applications, including flexible structures (Griffin Jr. and Maybeck, 1995), polymer deposition (Krauss and Kamen, 1996) and oxide etching (Vincent et al., 1997b) in electronics manufacturing, and unmanned flight control (Lane and Maybeck, 1994). Often, these applications do not exactly match the multiple model estimation struc149
ture. In particular the model uncertainty may be parametric, but the unknown parameter may vary continuously, rather than discretely. A possible approach to apply the multiple model method is to discretize the model parameter space. The choice of discretization is discussed by Sheldon and Maybeck (1993) , who propose a design method to best choose a discrete model set to approximate the true uncertainty in order to minimize the estimation error. However, for computational reasons , it may not be possible to grid the parameter space sufficiently fine to obtain the true model parameter to the desired accuracy. Our approach is to use a neural network in conjunction with the multiple model estimators to interpolate between them. This method is related to that described in (Fisher and Rauch, 1994), where a neural network is used to extend the region of operation of an extended Kalman filter. However, there are many advantages to using a multiple model approach, and it is Our objective to reduce the computational expense by enabling a coarser discretization of the uncertain parameter space.
To improve the accuracy of the multiple model approach requires exploiting some of the specific requirements of the end-point estimation problem. The key task is to obtain accurate estimates of the remaining film thickness, but only for the period of time just before the desired end point is reached. If, before the end point is reached, an estimate of the true underlying stack can be obtained, an improved estimate of the remaining thickness could be calculated during the remainder of the etch using a single estimator that uses this newly estimated stack model. Restated, a possible procedure is to process data in preparation for a system identification until accurate estimates are required, and then perform the system identification and use an estimator designed for the identified model after this point. Thus we have explicitly separated the tasks of system identification and estimation which are usually combined in an adaptive estimation algorithm. However, the on-line nature of the estimation problem lends greater demands for fast convergence and low computational requirements than for an off-line system identification problem. The method presented here of combining multiple model estimation with a neural network addresses this need.
This work is very strongly motivated by the following technological application. In microelectronics manufacturing, a common processing step involves reactive ion etching, where features are etched into previously deposited material. An important problem is estimating when to stop the etch, or in other words, when to call the process end point. One sensor which is used to determine film thickness is reflectometry, where by collimated light is directed onto the wafers surface and the reflected light intensity is monitored. As has been shown in (Vincent et al. , 1996; Vincent et al. , 1997b) , the etching/reflectometry system can be modeled with linear dynamics and a nonlinear output equation, which maps the remaining film thickness to reflectance. This mapping is well defined if the underlying materials are known. On the other hand, if there is some uncertainty in the underlying stack then the nonlinear output equation is uncertain.
2. METHOD In multiple model estimation, the a-posteriori probabilities P[Oj Iyk] for a discrete set of models are calculated (under linear/Gaussian assumptions) via
p [O lyk ] = J
p(yk IOj)P[Oj ] LiP(yk JOj )P[Oj ]
where P [Oj ) is known a-priori and p(yk IOj) is the probability distribution function for the measurements yk given model 0j E S and is calculated on-line as
To deal with this uncertainty, one approach would be to define new state variables corresponding to the uncertain parameters with integrator dynamics. This approach is not successful in this application as there are severe observability problems in the resulting estimation problem. In addition, the etching step can be fairly short , and there is limited time for an adaptive estimator to converge. This suggests the use of the multiple model estimation technique where by a set of models is chosen which discretizes the uncertainty space. However, accuracy is critical, and errors in the stack model will contribute to errors in the estimated remaining thickness.
Where y{ is the Kalman Filter output estimate for model j and time i, and Cl is the output covariance matrix which can also be calculated by the Kalman Filter. Note that p(yk IOj) is simply a measure of the closeness of the observed data to that which would have been produced by model OJ . The optimal multiple-model estimate given in (2) is a mixture of the estimates of each model weighted by the a-posteriori probabilities P [Oj Iyk ]. The greatest weight is given to the model with the maximum a-posteriori probability. This model is 150
In order to simplify calculations and to improve scaling, define k
Vj(yk) := L(Yi - yt),(Cn-1(Yi - Yf) i=l
. \
\
•9
3
and observe that -logp(yk IBj) ex: Vj(yk) modulo a constant. Thus Vj(yk) also qualifies as a measure of the goodness of fit , and will be used in what follows .
.
i
I
'/
'~
I
For clarity, first consider the case with the noise and disturbance equal to zero and the input and initial conditions fixed and known. Then the output yk is a function of B only, and thus so is the measure of fit Vj(yk) . Let
I
Fig. 1. Identification using distance from multiple models termed the maximum a-posteriori (MAP) model estimate. Often, the weight corrisponding to a single model dominates the others. For example, in (Baram and Sandell Jr., 1978) general conditions, including the usual LTIjGaussian assumptions, are given such that limk-+oo P[Bjlyk] = 1 for Bj closest to the true model in an information distance metric, and limk-+oo P[Bi lyk] = 0 for all other models. This makes explicit the system identification which is implicitly performed by the multiple model adaptive estimation. In order to calculate the MAP estimate in the case of continuous parameter variation, one would like to calculate the probability density p(Blyk) for arbitrary values of B in order to perform a numerical search for the maximum. This would be given by p(Blyk) =
r(B) =
~~~~:~:~~ 1 .
[VN(yk(B))
denote the mapping from the true parameter B to the vector of measures of fit for each estimator. This mapping can be (locally) inverted if it is (locally) one to one and continuous. Clearly, at minimum, we require N 2: d where d is the dimension of B. Because this mapping is an extremely complex function of B, it is proposed take advantage of the well known universal approximation properties of a feed forward neural network to determine this inverse mapping r-1(v). Thus, we will have • k k k B=NN(Vl(y ),V2(y ), ...,VN(y)) where N N (.) indicates a neural network. In this way, the neural network interpolates between the fixed values of Bj chosen for the multiple model estimator.
p(yk IB)p(B) Jp(ykIB)p(B)d8
but it cannot be calculated as p(yk IB) is only available for the discrete values of BinS. Clearly, a numerical search for the maximum of p(B lyk) would be very computationally expensive for a general, perhaps time varying or nonlinear, model.
The training of the neural network takes place off line, by simulating the system (1) with choices of B distributed over the expected range. Note that this distribution can be much finer than that chosen for the multiple model estimators. Since this neural network is trained off line, the online computational cost is determined solely by the number of estimators in the multiple model estimator. Because of the interpolation which is afforded by the neural network, fewer estimators may be required, and the speed of the multiple model estimation can be improved.
As an alternative, it is proposed to use the information contained in p(ykIBj) as a measure of the distance of the model Bj to the correct model. By comparing the relative distance to each model in S, an estimate of the true model parameter can be obtained. The idea is depicted in Figure 1 for the case when B E ]R2 . Suppose that we have calculated that p(yk IBj) = Cj for Bj E S . The rings represent the set in ]R2 for which the expected value of p(yk IBj) is Cj given Btrue is the correct parameter. That is, each ring represent s the set {BIE[p(YkIBj)IB=Btrue] =Cj} for some j , where the expectation is taken over the data yk . If the intersection of these sets could be found , an estimate of the true parameter Btrue could be obtained.
The proposed estimation procedure as depicted in Figure 2 is thus as follows: • N estimators are designed for different values of B • For a fixed amount of time (up to time k) , these estimators are run in parallel to produce measure of fit Vj(yk) 151
where er is the nominal etch rate of the RlE , {yer is the etch rate drift due to disturbances, and d is the film thickness. Note that the nominal etch rate has been modeled as a 1st order response. The input u is the forward power to the RlE, and is usually held fixed through the etch, thus the input will be a step. The constants a and b can be fit to match the nominal response of the RlE. The reflectometry output y is the amount of reflected light, which is a function of the material stack properties (B ,
~~~
Multiple Models
Neural Network
Single Estimator for Final State Estimates
Fig. 2. Structure of combined estimator • These measures of fit are use by the neural network to obtain estimate • A single estimator designed for is used in the remainder of the run. This estimator can be initialized using the states and error covariances of the estimator in the multiple model bank with the smallest value of Vj(yk) , or the data collected up to the current time can be run through the estimator.
e
e
Consider an etch of amorphous silicon (a-Si) on a silicon nitride/tantalum (SiN",/Ta) stack with an initial a-Si thickness of 900 A and a desired endpoint a-Si thickness of 500 A (see Figure 3.) In this case, d is the thickness of the a-Si layer, while
In the more general case when input u and initial conditions Xo are not known ahead of time, the trajectory yk must be considered a function of them as well. Let the input and initial conditions be parameterized in a suitable form, for example M
u(t) =
L a,ui(t) L
Xo = Lt3iXt i=l
where ui(t) and xb are fixed and knm..,n. Let 0: = [0:1 ... O:M(,t3 = [1'3 1 . .. t3 L ( . Then the mapping becomes r( B, 0:, and N :::: d + M + L for inversion.
a-Si
s~~r----------------~
m
Ta Fig. 3. Experimental stack to be etched The operation of the multiple model estimator was as follows: since the goal was to achieve end point at 500 A, a decision as to the true Sil'", thickness was made with 600 A remaining. The best estimate of remaining film thickness (the film thickness estimate from that estimator with the smallest value of Vj(yk) was used to determine when 600 A was left. To account for variations in etch rate, Vj(yk) was calculated as follows:
3. RESULTS This technique was applied to the refiectometry estimation problem described above. As described in (Vincent et al. , 1997b) , the etching/ refiectometry system can be modeled with simple linear dynamics and a nonlinear output equation:
152
,
XIX)
2!Bl
1
BipocmoIry-r •fl ~eswr. ~
.. +
.,...,
2!MO .,..."
•• • 'i ••
.,.",.
~
~2!m 1:
.B2!Bl
.~
~
-- •
J;
it
2!Bl
---
J2S2D
'lI
)2!Bl
• •••
2B«l
:2IIX) :2IIX)2BIIl2B«l2!Bl2!Bl2!m2S2D2!M02!Bl2!BlXlXl Tnadicxr1 _ _
:lS!D 1
= ksOO-1 k 800
(
4
5
6
8
Fig. 5. Experimental Results
~o
"
3
8d1.
Fig. 4. Validation Results
V) (yk)
2
. _
~ Yt i=k800
j .-j)'(C )-l( Yi 1Ii i
estimator was used with 3 models with SiN %thicknesses of 2850, 2900 and 2950 A, but without interpolation so the SiN%estimate could only be rounded to the nearest 50 A. In Figure 5, the SiN% estimate using the same neural network interpolator is shown, along with an independent measurement of SiNx thickness obtained using spectral ellipsometry (SE). The results are quite good, with a mean squared error of 3.8 A and a maximum error of 16.4 A.
.-j)
- Yi
where ksoo is the sample point at which 800 A is est.imated to be remaining, and kooo is the sample pomt at which 600 A is estimated to be remaining. The multiple model estimator was applied with 3 models. The models had silicon nitride thickness of 2800, 2900, and 3000 A respectively. A neural network was trained on 400 simulated etches. In the simulated etches the silicon nitride thickness was varied uniformly between 2800 and 3000 A the initial amorphous silicon thickness was varied between 875 and 975 A, and the steady state etch rate was varied between 4.5 and 5.5 A/s. Measurement noise was also added. More details on the simulation of in-situ refiectometry can be found in (Vincent et al. , 1997b). The 3 input/single output neural network had 2 hidden layers of 2 nodes each. The key idea is that at 600 A remaining, the true silicon nitride thickness is estimated, and this value will be used in the last part of the etch with a single-model estimator for accurate endpoint.
4. CONCLUSION
An adaptive estimation method combining multiple model estimators and a neural network was presented. This method used initial data and a multiple model estimator to obtain measures of fit which were input to a neural network. Because of the structure of the motivating problem, the adaptive estimation technique consisted of a system identification phase followed by an estimation phase. Thus the main contribution was developing an identification method which was compatible with the speed and computational requirements of on-line, real-time estimation.
The neural network was validated using 50 additional simulated etches with similar variations in initial conditions, and the results are shown in Figure 4. Note that the neural network is able to recover the true silicon nitride thickness quite accurately for silicon nitride between the thicknesses of 2800 and 3000 A, even though the multiple model estimator contained film stack models only for the discrete values of 2800 A, 2900 A and 3000 A. The mean squared error was .6 A and the maximum error was 13.8 A.
A motivating problem was described and used to validate t he method. This problem was to determine from refiectometry data the amount of remaining amorphous silicon film on a wafer during etching in the face of uncertainties in the silicon nitride/ tantalum st ack. The combined estimator performed quite well, estimating the uncertain silicon nitride thickness with a mean squared error of 3.8 A. However, there were the following fortuitous aspects: only a single parameter was unknown, the input was fixed , and the state dimension was small. This allowed the use of only 3 estimators, and t raining could be accomplished with 400 simulated runs.
This estimator was also validated using experimental data for the etch described above. This data is from experiments previously reported in (Vincent et al., 1997 a) where a multiple model 153
This paper only introduced this particular method of combining multiple model estimators with a neural network, and many open equations remain. In particular it would be useful to determine under what conditions the mapping r is locally invertible for a class of models. It would also be useful to determine those parameters to which r is insensitive. For example, if the true system is stable, then the initial conditions will have an effect which decreases in time.
5.
res, Eds.). Materials Research Society, Pittsburgh, PA. pp. 87-94. Vincent , Tyrone L., Pramod P. Khargonekar and F. L. Terry, Jr. (1997b). End point and etch rate control using dual wavelength reflectometry with a nonlinear estimator. J. Electrochem. Soc. 144(7), 2467-2472.
REFERE~CES
Baram, Yoram and Nils R. Sandell Jr. (1978). An information theoretic approach to dynamical systems modeling and identification. IEEE Trans. Aut. Cont. AC-23(1) , 61-66. Fisher, William A and Herbert E Rauch (1994). Augmentation of an extended Kahnan filter with a neural network. In: Proc. of IEEE Conf. on Neural Networks, Orlando FL. pp. 1191-1196. Griffin Jr., G. C. and P. Maybeck (1995). MMAEjMMAC techniques applied to large space structure bending with multiple uncertain parameters. In: Proc. 34th Conference on Decision and ControL pp. 1153-1158. Krauss, A. F. and E. W. Kamen (1996). A multiple model approach to process control in electronics manufacturing. In: Proc. IEEE/CPMT International Electronics Manufacturing Technology Symposium. pp. 455461. Lane, D. W. and P. S. Maybeck (1994). Multiple model adaptive estimation applied to the lambda urv for failure detection and identification. In: Proc. 33rd Conference on Decision and ControL pp. 678-683. Magill, D. T . (1965) . Optimal adaptive estimation of sampled stochastic processes. IEEE Trans. Aut. Cont. AC-I0(4) , 434-439. Sheldon, S. N. and P. S. Maybeck (1993). An optimizing design strategy for multiple model adaptive estimation and control. IEEE Trans. Automat. Contr. 38(4) , 651-654. Vincent , T. L., P. I. Klemicky, W. Sun, P. P. Khargonekar and F . L. Terry Jr. (1997a). A highly accurate end point method for a tft back channel recess etch. In: Proceedings of the 1997 International Display Research Conference. Vincent , T . L., P. P. Khargonekar and F. L. Terry, Jr. (1996). An Extended Kalman Filter based method for fast in-situ etch rate measurements. In: Diagnostic Techniques for Semiconductor Materials Processing II: Symposium held November 27-30, 1995, Boston, MA (S. W. Pang, O. J . Glembocki, F . H. Pollak, F . G. Celii and C. M. Sotomayor Tor-
154