A strategy for simultaneous dynamic data reconciliation and outlier detection

A strategy for simultaneous dynamic data reconciliation and outlier detection

Computerschem.EngngVol. 22, No. 4/5, pp. 559--562,1998 Copyright© 1998ElsevierScienceLtd.All rightsreserved Printedin GreatBritain PII: S0098-1354(97)...

302KB Sizes 4 Downloads 216 Views

Computerschem.EngngVol. 22, No. 4/5, pp. 559--562,1998 Copyright© 1998ElsevierScienceLtd.All rightsreserved Printedin GreatBritain PII: S0098-1354(97)00233-0 0098-1354/98 $19.00+0.00

Pergamon

A strategy for simultaneous dynamic data reconciliation and outlier detection J. Chen and J. A. Romagnoli* ICI Laboratory of Process Systems Engineering, Chemical Engineering Department, University of Sydney, Sydney, NSW 2006, Australia

(Received 3 December 1996; revised 5 May 1997) Abstract

The presence of outliers corrupts the procedure of dynamic data reconciliation. In this note, a cluster analysis technique is suggested as a way for discriminating outliers and normal observation data. Furthermore, the formulation of the dynamic data reconciliation problem is modified to incorporate the outlier information. In this way, dynamic reconciliation can be carried out simultaneously with outlier detection. The performance of the proposed approach is demonstrated by simulations on a chemical engineering example from literature. © 1998 Elsevier Science Ltd. All rights reserved Keywords: dynamic data reconciliation; outlier detection; cluster analysis

Notation

DIST i=Minimum distance between measurement Y~and any other measurements in the moving window. f=Differential equation constraints, g=Inequality constraints including simple upper and lower bounds h =Algebraic equality constraints r M=Mean minimum distance y =Discrete measurements ~(t) =Estimate measurements V=Covariance matrix Vk=The kth diagonal element of the covariance matrix V w=Trust degree o f y referring the main body oftbe data Greek Letter (/)=Objective function Acronyms MMD=Mean minimum distance 1. Introduction

Reliable process data are the key to efficient control and operation of chemical plants, but measured process data often contain noise and are frequently contaminated * Author to whom all correspondenceshould be addressed.Fax: +61 2 9351 2854. E-mail:[email protected].

by degradation, human error, or other unmeasured disturbances. Implementing data reconciliation techniques can improve the understanding of a process by obtaining optimal estimates of flowrates, temperatures, concentrations, etc. to achieve better quality control and increase profits. Generally, model-based data reconciliation procedures require the absence of gross errors, otherwise the reconciled value will exhibit "smearing" when compared with the true values (McBrayer and Edgar, 1995). Gross errors can be divided into two classes: (1) measurement bias and (2) outliers. The bias refers to the situation in which the measurement values are consistently too high or too low. Outliers, however, may be considered as some abnormal behaviour of measurement values, process peaks or unmeasured disturbances, for example. There are several well-developed gross error detection strategies available now for using in steady state situations. An excellent survey of recent developments in this area is provided by Crowe (1996). Generally, the steady state gross error detection technique is to configure a hypothesis test. The null test is the normal distribution with zero mean about the measurement errors, which is the common assumption on measurement noise. However, there are some difficulties in extending the idea of steady state gross error detection scheme to the dynamic situation. For instance, the expected value of measurement does not exist. Finding a reliable and efficient way to deal with gross errors is still a challenge facing dynamic data reconciliation. In this

559

560

J. CHEN

paper, we present a method for dealing simultaneously with the dynamic data reconciliation and the outlier detection problems (in terms of eliminating their influences). 2. Outlier Identification

Mathematically, data reconciliation is defined as the optimal solution to a constrained least squares objective function. Due to the sensitive nature of the least squares, one or two outliers can sometimes corrupt the whole procedure and yield misleading results. It is essential to detect outliers and bound their influence in data reconciliation. An outlier is, by definition, a measurement in which the error does not follow the statistical distribution of the bulk of the data. Normally, outliers are a small fraction of the whole data and have little or no relation to the main structure. Therefore, it is possible to distinguish outliers from normal data by comparing their behaviour. Specifically, we suggest using cluster analysis as a method of discriminating between outliers and the data main structure. This has the advantage that no a priori assumption on process measurement has to be made. Generally, cluster analysis seeks to divide a set of samples or objects into several groups or clusters. Objects within the same group are more similar to each other than to objects in different groups. When the process is in dynamic state, an elongated type cluster is considered as a suitable candidate for describing the process feature when outliers are absent. A typical example of an elongated cluster is shown in Fig. 1. In a real process rather than a simulation, the changing of measurements should be continuous, consistent and relevant; in contrast, outliers are those nonrelevant abnormal observations. The way to identify outliers is to look for such an elongated cluster which conforms to the normal behaviour of the measurement. In this way, any data points which do not belong to the underlying elongated cluster are considered outliers. In our case, we use distance as a similarity measure. Unlike the popular C-means algorithm, which assigns each

object to the cluster with the closest cluster centre, we instead assign each object to the cluster of its nearest neighbour within a certain distance (Yin and Chert, 1994). The criteria for identification of the cluster is the mean minimum distance (MMD) which is the mean distance from one object to its nearest neighbour. Given a set ofNobjeets Yl, Y2..... Yn, in a d-dimensional space, the MMD rM is defined as

rM: ~

i=l

(la)

k=l

In practical situations, the variations of individual measurements may be different; for example, flowrate measurement is noisier than temperature measurement. This may lead to hiding outliers within a smoother variable containing normal variations of noisy variables. To avoid this happening, each variable should be weighted by its own variance. Thus (la) is rewritten as:

where Vk=the kth diagonal element of the covariance matrix V. The idea of a moving window is utilised to capture the latest process behaviour. In fact, the length of the window is a tuning parameter: if it is too long, the lag will happen; if it is too short, the outlier will be tolerated. 3. Simultaneous dynamic data reconciHatlon and outlier detection

Non-linear dynamic data reconciliation was recently investigated by Edgar and his co-workers.(Kim et al., 1991; Liebman et al., 1992). They have demonstrated the advantages of using non-linear programming techniques on the dynamic data reconciliation. This technique can efficiently perform computations for both linear and non-linear models. The dynamic data reconciliation problem can be written as 1

T

J

mi~t)~ = i=0 ~ 2 [~,(t~)- y~] V - [~'(t~)- Yi]

(2)

Subject to: tI ~

,~'(t)]=O

h[~(t)]=0 g[~'(t)]->0

Fig. 1. Three clusters in a two-dimensional space: A, C are noise/outlier clusters, B is an elongated cluster.

where S,(t)=estimate measurements, y=discrete measurements, V=covariance matrix, f=differential equation constraints, h=algebraic equality constraints, and g=inequality constraints including simple upper and lower bounds. In order to incorporate outlier information into the data reconciliation procedure, we modify the objective function of dynamic data reconciliation as

A strategy for simultaneous dynamic data reconciliation and outlier detection i=O 2 {Wi[~(ti)--Yl]}Tv-I[wi[y(ti)--Yi]}

(3)

where w=trust degree of y referring the main body of the data. The trust function refers the belonging degree of each individual measurement to its own main structure, w~is defined as: 1 ifDISTi~-2rM 2ru

Wi--'~

if DISTi>2rM

(4)

where DISTi=minimum distance between measurement Y~and any other measurements in the moving window. In effect, the new objective function ((3)) allows the optimiser more freedom in adjusting the estimates of the suspected measurement (outliers). In this way, the influences of outliers will be eliminated and a more reliable result can be expected.

output temperature=4.6091. At time step 30, the feed concentration was stepped from 6.5 to 7.5. In our study, the simultaneous dynamic data reconciliation and outlier detection problem are solved through a sequential strategy. In this approach the dynamic model equations are solved at every iteration of the optimising using a differential equation solver embedded within the optimisation strategy. The estimated values for the state variables are shown in Figs 2 and 3. Figures 4 and 5 show the estimates for the inputs. Please note that Figs 3(b) and 4(b) are a rescaling of Figs 3(a) and 4(a) to show better details of the dynamics. In these figures the circle corresponds to the measurements, the dotted line in Figs 3(b) and 4(b) to the true data (free of noise) and the full line is the estimate of the measurements. Also, included in the figures are crosses which are the estimates of the process variables using the conventional approach. The result of applying conventional data reconciliation is, in essence,

4. Illustration example

20

The performance of the proposed method has been tested on the same CSTR system used by Edgar and his co-workers (Kim et al., 1991; Liebman et al., 1992; McBrayer and Edgar, 1995). There are four variables in the system and all of them are assumed to be measured. The two input variables are feed concentration and feed temperature, while the two state variables are output concentration and output temperature. Measurements for both state and input variables were simulated at time steps of 2.5 s by adding outliers and Gaussian noise having standard deviation of 5% to the "true" values obtained through numerical integration of the dynamic state equations. The number of outliers equals 10% of the total data. The CSTR simulation was initialised at a steady state operation point of feed concentration=6.5, feed temperature = 3.5, output concentration= 0.1531 and

18

(a)

0 o

16 ~ 14

~'es® 12 E lO ~es a ~ 6 O 4

0

2 o



,

,

50

100 150 Sampling Instant

(b) + 0

200

250

o 0

0 0

s O es

0.~ 0

I--

0

0 4.8

O0

0

0 o ~

'" O0

(90

oo

.~.

0

0

0

o

-~ 4.6 0 4.4

+

0 0

0

GO

0.,'

++0 +

0

0

0

0."

I

o

5.2

0.,'

3o

,

5.4

0.7

~

561

0

0 o

.

.

0

.

~

.....

.

~

. . . . of o 0

0

00

0 O0

o oo o°

o 0

o o

o o

o

4.2 0 Sampling Instant

Fig. 2. Concentrationestimate response to step change in feed concentration,

0

50

100 150 Sampling Instant

200

250

Fig. 3. Temperatureestimate response to stop change in feed concentration.

562

J. CHEN

to split one large error (outlier) onto each of the measurements in the whole range of the windowing, i.e. one outlier will have effects on all measurements as indicated in these figures. It is clear from these results that the presence of the outliers degrades the performante of the reconciliation procedure. Furthermore, with the presence of outliers in the data set the convergence time for the algorithm increases drastically for the conventional approach when compared with the proposed strategy. In the new approach, a good result is presented as expected, since the influence of outliers is taken into account by the trust function w. It is worth mentioning that a moving window strategy, recommended by Kim et al. (1991) and Liebman et al. (1992) for improving the performance of the optimisation, is adopted in this work with very good results.

o

~~ 0 0

7

"

O

oOo~o ° ~ o ~ 0

24 0

(a)

22

50

_o_o,~oo~O C~oo°o, 1O0

150

200

250

Sampling Instant

'

Fig. 5. Feed temperature estimate response to step change in feed concentration.

20

~ 18

0

0

18

5. C o n c l u s i o n

a)

~ 14 0

0 12 II)

¢

U.

lO

4

+~

!

J

0

SO

~

i

1 O0 150 Sampling Instant '0

7.8

....

(b)--'

i

I

200

250

'

o

, 0

o

o

7.6

References

¢o 7.4 e-

~t -

7.2

o

o

7

o

.~

0

6.8

%+

o ~. 6.6

Crowe, C. M. (1996) Data reconciliation--progress and challenges. J. Proc. Control 6(2/3), 89-98. Kim, I.-W., Liebman, M. J. and Edgar, T. F. (1991) A sequential error-in-variables method for nonlinear dynamic systems. Computers chem. Engng 15(9), 663-670. Liebman, M. J., Edgar, T. F. and Lasdon, L. S. (1992) Efficient data reconciliation and estimation for dynamic processes using nonlinear programming techniques. Computers chem. Engng 16(10/11), 963-986. McBrayer, K. F. and Edgar, T. F. (1995) Bias detection

0

+

o

O

0

In this paper a method for simultaneously performing dynamic data reconciliation and outlier detection is presented, based on a combination of cluster analysis techniques and dynamic optimisation. A case study of a CSTR typically used in the literature is used as a vehicle to show the behaviour of the proposed approach. Recently, a way for bias detection in dynamic situations was suggested by McBrayer and Edgar (1995). In future work we are looking to integrate their approach with the method presented in this paper, which may lead to a fully gross error detection strategy.

o

o

o

o

o +c 7

o o

6.4

°o'

6.2



6

I

0

5O

I

100 150 Sampling Instant

Fig. 4. Step change in feed concentration.

_.

_ I

200

250

and estimation in dynamic data reconciliation. J.Proc. Cont. 5(4),285-289.

Yin, P. and Chen, L. (1994) A new non-iterative approach for clustering.Pattern Recognition Letters 15, 125-133.