IDENTIFICATION OF HETEROSCEDASTIC PLANTS N. S. Rajbman and F. A. Ovsepian Institute of Control Sciences, Moscow, USSR
ABSTRACT In solving the plant identification problem it is often the case that the conditional variance of the output variable Y with the input variable X changes.
tion of the input variable X and the output variable Y. 'l'hus a linear regression and constant conditional variance of Y with respect to X are obtained. The use of dispersion techniques for random functions does not employ the Gaussian joint distribution of Y and X, but the condition of constance for conditional variance should be met. It is easy to show that efficiency of the dispersion identification technique depends on the degree of the conditional variance
In this case the identification methods are unwieldy and ineffective. The paper studies the he teroscedastic class of plants with conditional variance of Y as a function of X. In the beginning the conditions for heteroscedasticity to arise and the incidence of such plants are described; then the quantitative estimate of the degree to which the conditional variance is changed is introduced and identification algorithms for heteroscedastic plants are described.
CJJ(Y/x ) ~ MU ~- M(1'~)Y}
change, v/here 11 and D are the mean and the variance, r.l (y/x) is the conditional mathematical expectation of Y with respect to X. Identification accuracy with D being a function of X decreases owing to the following reasons.
KEYWORDS Identification Parameter estimation Probability Monte-Carlo methods Sampled data systems Heteroscedasticity
Firstly, the dispersion is greater for model parameter estimates which are built around the least squares technique (LST) with the conditional variance D(y/x) neglected. Secondly, the construction of models in the form of conditional mathematical expectation may be insufficient for solving the control problems which necessitates building another model for the same plant employing the conditional variance.
INTRODUCTION AND STATEMENT OF THE PROBLEM The large scale of research carried out in many countries in identification theory and applications (Astrom 1970, Eykhoff 1974,Isermann 1974, Pugachev, Kazakov, Jevlanov 1974, Tzypkin 1970, Voronov 1970, Sage, Melsa 1911, Raj bman , Chadeev 1975, Rajbman 1976, Proceedings 1976) have resulted in a significant number of efficient techniques for constructing the models and estimating the parameters of a wide class of control plants. Some of these techniques employ correlation (for linear models) and dispersion (for nonlinear models) random functions characteristics. The use of dispersion functions made it possible to estimate quantitatively the degree of nonlinearity and to obtain identification equation for a nonlinear plant from input/output data.
As an example consider a plant with real stationary random functions X(t) and yet) as input and output, respectively. Two-dimensional density of distribution of the random variables X=X(t) and Y=Y(s) is in the form
P
I('XI~)= 'j.
l-d.
~ J$, 1/1- Z~d
exp
(xt-~Z.x.,x,~+~~7+ a 0
f
.. exp [ - :'(1- z.~J
~
[_
J9,
C;Z
1
:t(1-rel.) If
I i-ex.'l::c~ I:;
x
(1)
K
(x2.+lz xl/-I-~)l xd d <1' J
where ~ is the distribution parameter taking its values within the closed interval [0,1] ,"t:x.:J denotes a parameter which \'Ii th a:, =1 or signifies the correlation coefficient
Applicability of identification correlation methods requires the use of the Gaussian joint distribu-
°
J 935
1936
N. S. Rajbman and F. A. Ovsepian
of the random values Y and X. The distribution (1) is a particular type of distribution widely used in the I.1arkov proc esses theory (Sarmanov, 1960). The relationships M(y/x) and D(y/x) for the plant under consideration are of the form
with (i=0 or the joint distribution of X and Y is Gaussian and consequently D(y/x)=const; the identification problem is then solved through the correlation techniques for random functions. With other values of a.- not only does the di stribution (1) differ from the normal, but D(y/x) becomes dependent on X and identification techniques work unsatisfactorily. In the extreme case, i.e. with oG =1/2, the model of this plant cannot be built since M(y/x)=O although the control problems may be solved with the use of the D(y/x) model. The class of plants for which the conditional variance D(y/x) is the function of X is referred to as h eteroscedastic. The output variable Y of the plants of this class may be presented as
where t,(x) is noise uncorrelated with X but with the conditional variance depending on X. Heteroscedastic plants may be real-world plants such as Earth magnetizm field intensity with respect to Earthreferenced coordinates or may reflect the productive activity of man in various fields of technology, for instance, in chemistry, the petroleum industry, petrochemistry,metallurgy, etc. These include the processes of surface-active sUbstance dissolution, diffusion in solid body, hydration of oil aldehydes, hydrorefining of diesel fuels and kerosine, catalytic reforming of gasoline fractions, oxygen convertor melting, etc. Generally any technological process affected by some uncontrollable disturbances such as wear of installations tools, formation of scale in heat exchangers, changes in raw material composition, meteorological conditions, etc. may be presented as a heteroscedastic plant (Rajbman 1961, Hummelblau 1970, Ovsepian, Vardanian 1973,
Ovsepian, Simsarian 1974, Bunich, Ovsepian 1916). In industry plants may be regarded as he teroscedastic as a result of the use of conventional measuring instruments. In some cases when these instruments differ in accuracy, although they deal with data on the same process heteroscedasticity may be recognized in advance. In other cases the instruments used are a priori of the same type but may show individual features intrinsic to them due to different conditions of storage and maintenance in the past, latent imperfectiOns, etc. Typical experimental data on chemical reactors contain the dependent variables which are the decreasing functions of range or time. While the measurement error is kept constant the relative error of the dependent variable increases with the independent variable. It is easy to show that the input variable measurement error as well as the nonlinear transformation of homoscedastic random functions may result in heteroscedastic independence between Y and X (Bunich, Ovsepian, 1916). Heteroscedastic plants can appear in scientific research due to the application of nonlinearization techniques. Thus the reasons why pl ants may be treated as het eroscedastic are classified into four major groups: 1. Specifics of the physical and chemical processes, 2. Nonlinear transformations of random functions, 3. Errors of measuring instruments, 4. Approximation techniques for control plant study. As has already been mentioned, in order to give a satisfactory solution to control problems for such plants two models M(y/x) and D(y/x) should be built for the same plant. Statistically 1ST identification of heteroscedastic plants requires the selection of the D(y/x) model structure by experimental data. Criteria of Eodel adequacy estimation known through the literature do not fit in this case even if Y is normally distributed with X fixed. The present paper treats the problems of D(y/x) measurement, obtaining the statistics of D(y/x) restoration through experimental data and the study of their sample characteristics and application of these in identification of heteroscedastic plants.
Identification of Heteroscedastic Plants
CONDITIONAL VARIANCE MEASURBMENT Consider two continuous dependent random variables in a rect~lar area .Jl. =[Q.~x...6:~) C ~~!::. dJ which may be infinite. X and Y are interrelated thro~h the density probability f(x,y), and one-dimentional densities are denoted as f1 (x) and f 2 (y). Assume that the random variable Y has only the first four moments, the first conditional moment bei~ M(y/x)=const and cL
~i(X):::
J!cx,lIj) ~ >
0 .
(5)
c. Let us show that in this case the measurement of the function D(y/x) is equivalent to the following variational problem. Among the set of feasible functions {
eJ
Clt
g
J(lex) /4 (x) J,x '" 1 a,..
respecti¥ely find the function maximizing the value of the integral form
Using the Euler-Lagrangian equation obtain the function providing the extremum of the functional (7) in the class of functions satisfying the condition (6) in the following form: (8)
With due regard for the original assumption of M(y/x)=const (9) may be rewritten as
r-q:;-r-9)-c-:-:}jI-=-'x):-:;"'L-r
(11)
'iiby:t Since flex) > 0 as follows from (5) it is pain that ;j =0 if and only if D(y/x)=const, i.e. ~ serves as a measure for the conditional variance D(y/x) change. It is this particular feature that is used in the statistical criterion of the function D(y/x) restoration by experimental data described in the paper. Reference (7 ) contains a result similar to (11) in a rather different form. A scalar or vector random function may be used in (11) as the variable X. If in this case yet) is also a random function, .rf is a function which measures the relationship between Y and X resulting from the change of D(y/x) of the heteroscedastic plant under study. If Y and X are coordinated of the same random functions at different times, than :J measures the inherent heteroscedastic properties of this random function. To solve the identification problem a sample value of J should-be computed from the experimental data while to find out whether the plant under study is homoscedastic or heteroscedastic class the sample distribution of :J should obviously be known. These problems are considered below. SPJ;;PLE
STATISTIC~ OF c.):I~ AND OF ,,}'iXFor compatibility of some computational procedures and convenience of identification nroblem statement introduce two theoretical characteristics of the random variables Y and X using";-~~--:---. (11):
W
Using (8) we may write
1937
::
d?C
4
;t) [1) Ufjx)J
( 1 2)
~y"
when ri' (y/x)=const
ancl_ (13 )
Since the integral form (7) is a correlation coefficient between the normalized and centered random variabIes \1:: ~2-fi(.!1~and P(x) and that
J'?- 0 write
d
JiJy,';:
as followa from (9) we may (10) O~.J£:i.
when D(y/x)=const. The value of Jyx and of CJ;pc ch::ll1C;es wJ. thin a range of [0,1] , which is easily seen from (9) and (10) with due regard for the relationship M(y2/ x ) = J,12(y/x) + D(y/x). The sample values of c.J~xapd J";J!X. are denoted as Wlj::lC and "J:J~ res-
N. S. Rajbman and F. A. Ovsepian
1938
pectively. The co~putation of point estimates of the statistics WljXoo and .;};pc. from experimental data employs a general technique used in mat~ematical statistics, the essence of which is in substitution the theoretical moment characteristics with their sample values. In formulas below the computation W~x. and :(jiJ!L suggests that a) X is a nonrandom vector of dimensionality N the terms of which may take on q various values in the course of the experiment, and b) with the i-th combination of the components of the vector X the dependent variable Y takes on m. values so that 'mj, -= n, 1 ~
t
If it is random the vector X may be reduced to the above algorithm via grouping and substitution of the samples of the terms Xi (i=1 ,N) with the central points in associated grouping intervals. In this case q is the number of nonrepeating combinations composed of the samples of the components of vector X. ~ith due regard to th~ above, the statistics Wa-%.and iJoxare computed by the for~m~u~l:a~s~______~.-=- ~
~
where
;:1
_
1n..,
q.
(,~f - ~ h
s:'-
In. __.t)t__
*1 1J"i
(14)
are the measured values of Y. It is obvious that WlI ", and ~lfoc. con~erge in probability to W~x and ~~~ , i.e. the statistics suggested are consistent although,as will be shown below, they are positive-biased. T~e sample distributions
"
W~x
and may be accurately defined if one assumes that the general totality distribution, i.e. that of the output variable Y of the plant under ' study, is known. But even if this distribution is normal the problem of obtaining accurate distributions of the unknown statistics faces serious computational difficulties which necessitate good enough approximations of these distributions. If we recall the fact that usually the general totality distribution is known a priori it becomes plain that determination of accurate sample distributions of WlJ~ and ~ ~ is not necessary. ~~oc.
(j
Practically the identification problems always require finding an appropriate approximation of sample statistics distribution when the general totality distribution belongs to a certain class of distributions most widely used in experimental studies. This may be achieded, for instance, through finding certain moments of statistics of a not very high order and the use of them in constructing, say, one of the Pearson curves. This procedure waS used by the authors ~or the st~dy of sample statistics w)jx and ~:t:x:. undertaken together with A.A.Yaralov. Th~ moment characteristics w~xand ~~~ were obtained with the use of the statistical test technique which generated samples of different sizes out of the given class of general distributions. These moment characteristics were obtained by 500 data entries. It was shown that with relatively large An (n ~ 1.QO) the distribution W~f" and ~~x are very accurately approximated via normal distribution, i.e. these sample distributions do not depend much on the general totality distribution. This point is much more important than the conventional mathematical statistic criteria of uniformity test for variance homogeneuity, for instance, the Bartelett criterion and its modification which are very sensitive to the general totality distribution. To obtain the parameters of the Aq sample distributions w~oc and ~dx with respect to the sample volume n the same procedure of statistical tests was used. The general totality
1939
Id e ntification of He t e ros cedastic Plants
was sampled with M(y/x)=const and D(y/x)=const. The sample volume was between 100 and 1000, therefore the parameters below of the sample normal distribution may be used only within the given range of n. Verification of statistical hypothesis requires the consideration of algorithms for heteroscedastic plant identification and necessitates that the moment characteristics of the variable Y be associated with the general totality. The means U)~~ and J.~.,., and the root m~an square values G"c.> and ()~ of w!J:x.. and iJ'~'X-with respect to n(n=100-1000) will be,respectively, -
-3
c.J~r;t. :: 0,54 - 0,5·10/1.+ o,?dO
bw '::. -q
O,OG6 -
11'd'L :: 0,:; -
-, J,
n;
(16)
O)GS6.10-1i'7t.+O,35it·~(J1~7)
0, ?> Q·~O-
0 ')1,+
O,:t'10
-6
n
~
~~:: O,095_0,~",·w-3n+o,55.~o-1flt~
(18) (19)
The simulation results shol that the estimates CJ~", and ~l~ are positive-biased and with,n=1000 the bias ~ltx, =0,34 and ;JJJx, =0,12. With small n these biases are larger. IDENTIFICATION ALGORITHM Figure 1 shows the block-diagram of the identification algorithm. Its basic function is changing the statistical hypothesis wi1;.h the us~ of sample distributions w~x and ~3~· Three hypotheses are formulated. Hypothesis I. The plant under study is homoscedastic, i.e. D(y/x)=const. In the light of the above this hypothesis is verified under the condition of no correlation between Y and X, i.e. with M(y/x)=const. To provide this the initial array ~~ is centered in each grouping interval (~id ~ 'J1.J and the estimate ~d~P is computed by formula (14). W:1::t. with the given sample size n is computed by (16) and Ow by (17). Confidence interval with the confidence probability.fo is determined using the fact that. &J40c is normally distributed; i f w;:I'"f:I.ft the hypothesis is not rejected. Practically this hypothesis answers the question if a homoscedastic model is constructed for the plant under study.
9;..)
I;
Hypothesis 11. There is no correlation interrelationship between the
variables Y and X, i.e. M(y/x)=const, therefore a purely h cteroscedastic model D(y/x) may be constructed for the plant under study provided that hypothesis I was rejected. Verification of hypot hesis 11 depends upon the r e sults of the hypothesis I verification. Therefore two verification versions should be considered. Version I: hypothesis I is true,i.e. D(y/x)=const. In this case the exncrimental data ~i.J are centered . (~~~ ~~- ff) and the e stimate ~~t~p 1I':f"" _ is computed by formula (15). JlJX and (Y-D are determined with given n by formulas (18) asd (19) and a confifence ijt erval I~ is defined. If Co Ip. the hypothesis is not rejected. Version 11: hypothesis 11 is rejected, i.e. D(y/x)=const. In this case the experimental data ~~ are centered and normalized
'J;:r
G~:: (1jiJ-~)/j(it- (~J"/J and these data are used in estimating via formula (11). Further verification is done in the same manner as in version I.
:t>;:P
Verification of the first two hypotheses answers,with confidence, the question which type of model would fit the plant under study. If both hypotheses are not rejected the model cannot be built since statistical interdependence between Y and X, if any, cannot be described by the functions M(y/x) and D(y/x). If hypothesis I is not rejected the regression function M(y/x) may serve as the plant model and it may be restored by experimental data either by correlation and dispersion technique or by the methods suggested in the paper as a homoscedastic plant is but a particular case of the heteroscedastic one. To find out whether the correlation or dispersion techniques should be applied to the algorithm the well-known hypothesis is verified that the correlation interrelationship between Y and X is nonlinear (20) where ~~~ is the dispersion function of the random variables Y and X and ZM~ is the correlation function. \'/i th a nonlinear interrelationship between Y and X one may obtain the average degree of nonlinearity within the plant operation time from
N. S. Rajbman and F. A. Ovsepian
1940
data grouping, formation of the array
centering
•
f computing
~3':f
'Yl,j
interval centering
computing
verification of hypothesis II,version 11
A
computing ~~
e7f
w~~
verification of hypothesis I
no
centering
yes no
-
~es
selection of the functions Jl(.[\'I/~) and .Jl(.{.!Ir~ and estimation of their parameters
selection of the function M(Yt'x,land estimation of its parameters
no
r---
formation the array
yes
°le 1J i.j
of formation et the array ~'J
computing
~
identification by dispersion methods
~
Go)~-.x:
t
L-.,
verification of hypothesis III Ino --
computing
'JJlt.
.. verification
of hypothesis III yes
"
yes
hw
identification by correlation methods
selection of the a..-- function ./H,(9/2!J and estimating its parameters
f
model building of ~(!1/tx:)
forming of the array -:1.~ --
,
'J
building a mode 1 q) ( 111.2:)
computing
end
-.
~
estimation of the ~arameters JK{~%) on the knowledge of weighted method of least squares
'--no
~'I!:.
verification of hypothesis III yes
end 1
end Fig. 1. Block-diagram of the identification algorithm for heteroscedastic plants
yes
rerur
no verification of proposition of nonlinearity
fo.
t
cJ:i~
verification of hypothesis II,version I
Identification of Heteroscedastic Plants
t ciT
by the formula :i.~T
T ioJ
i
(21)
na:x.(-Z;) eire:.
Hypothesis Ill. The final,third, hypothesis in this algorithm blockdiagram verifies the fitness of the LST regression relationships obtained in the experimental data. Consider one general case of applying hypothesis III when the regressions M(y2/x ) and M(y/x) are adequacyverified. This is the case when hypotheses I and 11 are rejected, i.e. two models should be built for the plant, M(y/x) and D(y/x). Construction of the model D(y/x) in this case employs the formula
cJJ eifx.) = I JlL UI%) - JU/{ o/x) I
i.
The functions M(y/x) and M(y/x) are somehow chosen and their parameters LST-estimated. Experimentally obtained are centered and normalized 7
t/ .1t. -: L d
,.
t
it:...fU C*,xJ ) JU( y'-I IX;.
Aq, t /f/qX
and
• The values of
W~:x. and are computed via (14) and (15) J.n whi9h 1Ji and are substituted by ~l 'and 1/...1,1; , respectively. The confidence intervals and are determined and two conditions are verified simultaneously x f T f31Y1 and
It
"i
GJ~~ f-
rc.Jfb
I;:
CASE STUDY The characteristics discussed in the paper were used in identification of oxygen convertor melting completion. This completion takes place in the last 5-7 minutes of the blowing-off process. Control goal of melting is stopping the blowing-off at the moment when the content of carbon in the metal reaches the required level. Direct measurements of carbon content is difficult enough because of high temperatures, medium aggressiveness, great process properties changes in time, etc. Therefore the value C is computed through the decarbonization rate V by the formula
io
c-:ock+ci't/L V{-c)d:t
where the magnitude sign is necessary; with certain values of X the restored regression function M(y/x) may be smaller than M~(y/x), particularly with those values of X where there was a very small number of the observed values of Y. Practical and simulation study results have shown that although seldom enough this may be true in only 5-7 per cent of cases of the number of combinations considered. If only M(y/x) serves as the plant model, or only M(y2/x) = D(y/x) ,the hypothesis is verified in the same manner as in the general case. So let us consider the verification of hypothesis Ill.
where
1941
if!
:J/
The hypothesis is not rejected if they both hold with the given probabili ty f
(22)
where Ck is the content of carbon in s teel, t he value of which was obtained through express-analysis when the melting was over. G is the current s t eel weight, to and tk are the completion process start and stop t i mes, respectively. The rate V is measured also indirectly through the composition and the ext ent of the outgoing gases. Identification of the completion process was carried out in order to obtain the C versus V plots which may help the operator to find the most probable time to cease bloydng to provide the required sort of steel (Ovsepian, Simsarian,1974). 'f he experimental dat~ were used to compute the value c...Jc·vx.p =0/14 by formula (14). The determined confidence interval ~o~~ and t h e verified hypothesis I have sho\vn t he process to be heteroscedastic. In addition to t he essence of chemical processes, heteroscedasticity of C versus V also very likely results from error accumulation when computing C by formula (22). Version 2 of hypothesis 11 was further verified on the correlation interrelationship between C and V. The verification has shown that there is no such relationship under large concentrations of carbon in steel and it is observable only under concentrations of less than 0.25 per cent, which is in good agreement with theory. Since the steel had less concentration, C versus V was plotted beginning with C=0.25 per cent. In that case A'tUZ.f was equal flc~ to 0.487 and hypothesis 11 was there-
N. S. Rajbma n and F . A. Ovsepi a n
1942
fore rejected. Thus the identification of the process under study was carried out over the left-hand branch of the block-diagram discussed. The sample size after the reduction of the number of experiments Vias n=565. Verification of hypothesis 111 has shown, for application purposes, a sufficient accuracy of approximating the curves M(y/x) and D(y7x) by the second and the fourth order polynomials, respectively. Accounting for heteroscedasticity increases the accuracy of the process by an estimated 5 to 7 per cent. At the present time this process is under further study. REFERENCES Astrom, K.J. Introduction to Stochastic Control Theory. Academic Press, New York, 1970. EYHKq, A.]., ~ .A.OBc err RH. K Kcc ne~O B aHKID Hen~HeAHNx 06~eKToB
yrrpaBneHKR. ABToM8 TKKa K TeneMexaHKlca, I;~ 11, IlJ'Ib. Eykhoff, P. System Identification. Parameter and State Estimation. J. Wiley, London, 1974. Himmelblau, D.M. Process Analysis by Statistical Methods. J. Vliley, London, 1970. Isermann, R. Prozessidentification. Springer-Verlag, Berlin, 1974.
OBCeITHH,
~.A., P.A. CKMcap1RH. W~ e HTK ~M KaU K R rr ponecca ~ O BO~KK KK cnopo~Ho-K 0HB ep To p HoA rrnaBKK C j qeTOM rrepeMeHHo A ~KCITepcK K . TDY~N BcecoIDsHoro conemaHKH
1Y
rrb aBTOMaTKqeCKouy YIT~8BneHKID, T.3, KS~-BO "HaYKa", 9'74.
Ovsepian, F.A. and N.A. Vardanian. Identification of he teroscedastic plants. Identification and system parameter estimation, part 2. Preprints of the 7rd IFAC Symposium, The Hague Delft, The Netherlands, 1973.
N 1Y CKMITOSK M ~~AK ITO K eHTKKKa KK K 0 8HKe ITa aMeT OB. 'J S~-BO leUHKepe a, Kn KCK, 1976. DyraqeB, B.C.~ ~.E. Ka3aKoB, il.r. EBn8HoB. UC HOBN CTaTKCTKqeCKOU TeopKK aBT OM 8TKqeCKKX CKCTeM. il s~-Bo "'iJam KHOcT poeHKe ", ivi OC KE a, 1974. PaI1 6MaH, II . C., B. r.l.'1a1-\eeB. DocTpo eHK e l !o ~ e n ePi rr~Ol! e C CO B rr poM3 Bo,Zl;CT Ba. ~,! 3 l\-BO 8H ep r ~1Rii, J.,ocKBa, T
T9'75".
Rajbman, N. S. The applicat i on of i dent i fication me t h ods in t he USSR - a survey. Automatica, vol. 12, Pe rgamon Press,1976.
Pa t16MaH, H. C. KO ppe JIfIl{KOHHbI e MeTO~ bJ orrpe.:n:e nellVlR np K6n K)'{eHHblx xapaKTep Mc TKK aBTOMaT KqeCKKX nKHKU. U3~ . ,AH CCC P ~ OTHi oBHepre TKKa K aBTOLa TVlKa, 1,_ 1. vb!. Sage, A.P. and T.L. melsa. System Identification. Academic Press, New York, 1971.
CapMasoB, O . ~ . C06C TBBHHue KoppenRuVlOH RNe WYHKUKK VI KX rr pKM eH eHWR B TeopkiK C Ta r~K OH ap H brx MapKOBcKKX rrpOL\eCCOB. AAH, T. 132, Ji: 4, 1960. [~ ITKKH, R.3. OCH OBhl TeopK w o6yqa ID*11XCR C'l CT eM. !'l 3,l1-BO "HaYKa ", Vl0CKBa, 19'10. BOPOHOB, A.A. OCROBN TeopKK aBTOMaT~qeCKOrO ynpaBneHKR. " S,lI-BO "8HeprKR", Jle rH!Hrpa~, 197 0.