SIGNAL
PROCESSING ELSEVIER
Signal
Processing
46 (1995) 57-74
Unsupervised time series classification* J.J. Rajan*, P.J.W. Rayner Cambridge
Universi& Engineering Department, Received
Trumpington St., Cambridge CB2 IPZ, UK
14 April 1994; revised 28 November
1994 and 4 May 1995
Abstract In this paper a scheme for unsupervised probabilistic time series classification is detailed. The technique utilizes autocorrelation terms as discriminatory features and employs the Volterra Connectionist Model (VCM) to transform the multi-dimensional feature information of each training vector to a one-dimensional classification space. This allows the probability density functions (PDFs) of the scalar classification indices to be represened as a function of the classifier weights. The weight values are chosen so as to maximize the separability of the class conditional PDFs. Statistical similarity tests based on the overlap area of the PDFs are then performed to determine the class membership of each training vector. Results are presented that illustrate the performance of the scheme applied to synthetic and real world data. Zusammenfassung
Eine Vorgehensweise zur uniiberwachten statistischen Klassifizierung von Zeitwerten wird im Einzelnen erlgutert. Die Technik nutzt Korrelationsterme als Unterscheidungsmerkmale und verwendet das “Volterra-Connectionist”-Model1 (VCM) zur Transformation der mehrdimensionalen Merkmalsinformation jedes Trainingsvektors in einen eindimensionalen Klassifikationsraum. Das 1iiBt eine Darstellung der Wahrscheinlichkeitsdichtefunktionen der skalaren Klassifikationsindizes durch eine Funktion der Klassifizierergewichte zu. Die Gewichtswerte werden so gewlhlt, dal3 die bedingten Klassendichten optimal separierbar werden. Statistische ilhnlichkeitstests auf der Grundlage der Uberlappungsbereiche der Dichten werden dann zur Bestimmung der Klassenmitgliedschaft jedes Trainingsvektors durchgefiihrt. Es werden Ergebnisse vorgelegt, welche die Leistungsfihigkeit des Verfahrens in der Anwendung auf kiinstliche und reale Daten illustrieren.
Cet article dCcrit un schema de classification probabiliste non supervis6.e de sCries temporelles. La mtthode utilise les termes d’autocorrelation comme caractkristiques discriminatoires et emploie le Modele Connectioniste de Volterra (MCV) pour transformer l’information des caracttristiques multi-dimensionnelles de chaque vecteur d’apprentissage en un espace de classification unidimensionnel. Ceci permet aux fonctions de densiti: de probabiliti: (FDP) des indices de classification scalaires dZtre representkes comme une fonction des pondbations du classificateur. Les valeurs de
*This work was supported by the Human Capital and Mobility Program Contract ERBCHBICT930523. *Corresponding author. Tel.: + 44-1223-332767. Fax: + 44-1223-332662. 0165-1684/95/$9.50 0 1995 Elsevier Science B.V. All SSDI 0165-1684(95)00072-O
rights reserved
of the Commission
of the European
E-mail:
[email protected].
Communities
under
J.J. Rajan, P.J. W. Rayner / Signal Processing 46 (1995) 57-74
58
pondiration sont choisies de fagon g maximiser la separabilitk des FDP conditionnelles de classe. Des tests de similaritt statistique basks sur la surface de recouvrement des FDP sont ensuite conduits, afin de dCterminer l’appartenance a une classe de chaque vecteur. Des rtsultats sont prCsentCs, qui illustrent les performances du schCma appliqui g des donnkes synthktiques et r6elles. Keywords:
Unsupervised learning; Time series; Classification; Autocorrelation;
1. Introduction Unsupervised classification deals with the problem of forming associations between the training vectors when the class membership of the training vectors is unknown or imprecise. Most standard tech-
niques for this problem require specific prior information relating to the attributes of the classes; many also require estimates of the minimum and maximum number of possible classes and the maximum number of training vectors allowed in a given class. Often this information is not available; in many medical or biological classification problems one may be trying to categorize types of compounds or cells where the class membership of all the available patterns is unknown and frequently no information regarding the number of discrete classes present is available. In this paper a scheme for unsupervised probabilistic time series classification using the Volterra Connectionist Model (VCM) is detailed [9, lo]. Time series classification differs from standard pattern classification in that it is often possible to obtain, from each training vector, estimates of the variances and covariances of the features extracted from that training vector. This relies on the assumption of ergodicity of the time series (i.e., time statistical measures can be used to obtain variance information which for patterns is generally only possible if an ensemble is available). The technique utilizes autocorrelation terms as discriminatory features and employs the VCM to transform the multi-dimensional feature information of each training vector to a one-dimensional classification space (the linear transformation from the feature space to the classification space, although suboptimal, is chosen so as to be robust to inaccuracies in the estimation of the covariances of the features). Using the covariance information of the features of each training vector and the VCM weights, an
Volterra series
estimated PDF of the classification index of each training vector can be specified. The network weight values are chosen so as to maximize the separability of the PDFs of the known clusters. A statistical test
based on the overlap area of the PDFs is then performed to evaluate the similarity of each training vector with the known clusters. The training vector is then assigned to a particular class or else a new class is formed with that training vector as its center. This procedure is repeated for all training vectors. A validation scheme is then applied which involves moving the training vectors from one cluster to another such that the ‘optimal’ class membership is obtained.
2. Initial clustering of the training set Consider NT time series training vectors, each of length N. In this training set there are M classes each containing N, training vectors (i.e. the m-th class has N,,,training vectors where M E {1, . . . , M}). Hence, NT = Cz= 1 N,,,. It is required to determine the number of classes present M, and the number of training vectors in each respective class m. The order in which the training vectors are tested for class membership is very important and may alter the final classification states. Initially, it is required to order the training vectors in terms of their statistical similarity. Choose a training vector and label as TV1. Using a measure of statistical similarity’ [14] such as directed divergence D or
‘Given two PDFs p,(y) and p2b) and a multidimensional random variable y; the Bhattacharyya distance is defined as B = - In [s”, ,,I’-] and the directed divergence
or Kullback-Lieber
number
is given
J.J. Rajan, P.J. W. Rayner I Signal Processing 46 (I 995) 57-74
Bhattacharyya distance B determine the training vector ‘nearest’ to TV, and label that TV,. Then determine the training vector ‘nearest’ to TV2 and label that TV3. This process is continued for all NT training vectors. The clustering must be implemented in this tedious way rather than simultaneously ordering the training vectors because neither the Bhattacharyya distance nor the directed divergence is a true distance measure in that they do not satisfy the triangle inequality (i.e. if J,,,(i,j) represents a distance measure for classes i and j based on a feature vector y of dimension m, then the triangle inequality states that J,(i,k) d J,(i,j) + J,(j + 0. Assume that plb) and p2Cy) are multivariate Gaussian densities with mean vectors ml and m2 and class covariance matrices K, and K,, respectively. It can be shown [14] that B and D are given by
+
A
ln
2
INI + KdPl
(1)
lK111’21K211’2’
D = i Tr(K; ‘Kz + K; ‘K, - 21)
+;(w,
-NI~)~(K;~
+K,-‘)(m,
-mz).
(2)
Usually the difference in clustering or ranking order that results from using the Bhattacharyya distance and directed divergence is extremely slight and the choice between which of the two to use is relatively unimportant. The directed divergence and the Bhattacharyya distance do not enable decisions regarding absolute statistical similarity to be made. It is not possible to specify a threshold value for either D or B that allows decisions to be made as to whether two class conditional PDFs can be considered so similar that their corresponding training vectors belong to the same class. Thus, in order to categorize the remaining training vectors, an alternative statistical similarity measure must be used.
59
3. Volterra connectionist model Ideally, one would like to measure statistical similarity by determining the hyper-volume contained in the overlap between adjacent multivariate Gaussian PDFs. As this is not analytically tractable it is necessary to transform the features to another space where it is possible to analytically evaluate some alternative measure of similarity; this can be achieved using the single layer connectionist network, the Volterra Connectionist Model (VCM). The VCM is a linear network that operates on non-linearly extended input pattern vectors. The network implements a general non-linear scalar function of the N-dimensional pattern vector x, but is linear in the weights. Given an input time series vectorx, from class m (where m E { 1,2, . . . , M}), we can form an extended vector such that we form the linear discriminator described by
where d, is the classification index for class m, Q&x,) are a set of fixed non-linear basis functions of the time series x, and the w, are weights which may be varied such that optimal classification is achieved. The extension vector chosen is the Heterogeneous Discrete Volterra Series [l] since Volterra Functional Analysis is the basis of many non-linear analysis methods of high order spectra and, in the absence of a priori knowledge of the nature of the time series, is a powerful method for representing the structural features of the signal. In general, the extension vector G(x) is a truncated series such that if the input vector x has N elements then the (N, K) discriminator can be written as d = =
WT@(X) wg
+
+
+ ; il=l
(4)
5 WilXi, + ,f t ..
F ix=
Wi, .” iKXj,
wil
.”
ilxilxi*
Xi..
1 J
\ Kth order
(5)
J.J. Rajan. P.J. W. Rayner / Signal Processing
60
3.1. Volterra extended vectors Given a stationary time series x of length N, the statistics of which are unknown, the task of feature extraction is to determine characteristics and attributes that will be useful for discrimination between classes. Assuming the signal can be described by a polynomial type expansion, the truncated Volterra extended vector of the signal enables a simple projection to a lower dimensional non-linear space. For example, define the state vector of length 2 time series observations at t = i and t = i + 1 as xT = [x(i) x(i - l)] .
(6)
Taking the (2,2) Volterra series expansion2 of x, we get the extended (time series) vector: YT(X)= [l x(i) x(i - 1) x2(i) x(i)x(i - 1) x2(i - l)].
(7)
Let e(x) = e{Y(x)} where the estimated expectation operator $ is defined as being the time statistical average over the signal. Thus, for a zero-mean process, e{x(r)} x 0 and the distinct, non-zero terms of a(x) are given by eT(x) = Cl y*(O)r^(l)l,
(9)
l-1
However, the estimator is asymptotically unbiased since the bias term approaches zero as the number of observations N goes to infinity. It is obvious that the use of autocorrelation functions as discriminatory features will not be optimal for all time series, but in the absence of detailed a priori knowledge regarding the nature of the data
‘Note that a second order Volterra expansion vector of length p will only introduce autocorrelation terns up to lag (p - 1).
this choice is reasonable. One significant advantage in using autocorrelations as features is that in most cases the correlations are ranked in terms of their discriminatory power. If one desires a network with p weights then it is acceptable to choose the first p autocorrelation terms since these will contain the greatest amount of representative information concerning each class. This can be empirically justified using a feature selection algorithm such as UNISOC [4].
3.2. Statistics of the autocorrelation estimator In order to determine an analytic expression for the conditional risk the class conditional PDFs must be determined. This in turn requires a knowledge of the marginal PDF for each of the autocorrelation terms used in the feature vector. For reference assume that the time series of each class is zero-mean and Gaussian. The joint distribution of the k(k + 1),/Zdistinct autocorrelation terms of the k x k sample autocorrelation matrix C can be shown [16] to be of the form
(8)
where y*(k)is the estimated sample autocorrelation at lag k. Hence, the autocorrelation function derives naturally from the second order truncated Volterra series expansion of the signal. The biased sample autocorrelation estimator is given by y*(k)= f y’k x(i)x(i + k).
46 (1995) 57-74
of a state function
,M-l,yjC,~N-:-2) =
YZ~(~-~),~=~ f (+(N -j)) x exp
-T
Tr(M-‘C)
1,
(10)
where M is the matrix of true second order moments, C is the symmetric positive definite matrix of the sample autocorrelation estimates (i.e. Cij = y(li - jl)) and N is the length of the time series. This distribution is known as the joint Wishart distribution. In general, an analysis of the distribution is extremely difficult primarily because the region of variation of each of the autocorrelation terms is unknown a priori; for example, the domain of variation of y(O) is 0 to cc, but it is not easy to specify those for the other y( a) terms which
J.J. Rajan, P.J. W. Rayner 1 Signal Processing
are conditioned by the fact that the matrix C must be positive definite. Hence, integrating out particular terms in C to obtain marginal probability distributions for the autocorrelations cannot be performed analytically [6]. Kendall and Stuart [6] show that from many points of view the Wishart distribution can be considered as a chi-squared distribution generalized to k-dimensional variation. They also give examples for specific values of k which indicate that the Wishart distribution approaches normality as N approaches infinity; however, there does not appear to be a rigorous proof of this asymptotic behaviour for the general case. Nevertheless, empirical results indicate that the Wishart distribution is well modelled by a Gaussian distribution when N is large even when the underlying distribution of the time series is nonGaussian. Figs. 1 and 2 show histograms of 10000 realizations of the autocorrelation function at lag 0 using time series of length N = 100 and N = 500, respectively, from an ARMA(3,l) process driven by Gaussian noise. Superimposed on each histogram
B@J_.
:
46 (1995) 57-74
61
is the Gaussian PDF approximation which was determined from a training set of only 50 realizations of the appropriate length time series. Figs. 3 and 4 show histograms of 10000 realizations of the classification index of the same ARMA(3,l) process. The classification index was constructed using a weighted sum of the first four autocorrelation terms; these terms being determined from time series of length N = 100 and N = 500, respectively. The weights chosen to construct the classification index are arbitrary. Since the autocorrelation terms are asymptotically Gaussian it is expected that a weighted sum of these terms will also be asymptotically Gaussian. In Figs. 3 and 4 a Gaussian PDF approximation, determined from a training set of 50 realizations, is superimposed on the histograms for comparison. Both sets of figures show that as N increases the histograms become more symmetric and the Gaussian PDF approximation becomes more accurate. It should also be noted that even for values of N > 100 the tail behaviour of the true Wishart distribution is reasonably well represented by the Gaussian PDF approximation
,.
,
Fig. 1. Histogram of y*(O)of an ARMA(3,l) process and its Gaussian approximation for N = 100.
J.J. Rajan, P.J. W. Rayner / Signal Processing
62
46 (1995) 57-74
0.4 Fig. 2. Histogram
of y^(O)of an ARMA(3,l)
process
even though the Wishart distribution is clearly skewed. Hence, the Gaussian PDF approximation is a worthwhile simplification given its relative accuracy to the true PDF for large N (i.e. in most practical classification problems N > 200 so that the variances of parameter or statistical estimates are not excessively large) and also because of the resulting huge reduction in mathematical complexity that is achieved. One possible justification of this asymptotic behaviour is that the Central Limit Theorem, under weak conditions, can be applied to the autocorrelation function, which is simply a random variable formed from the sum of a large number of squared correlated variates.3 Given this assumption it is clear that the classification index being a weighted sum of Gaussian variables will also be a Gaussian;
‘It should be noted that one of the necessary conditions for application of the Central Limit Theorem is that the random variates have finite variance. Data drawn from certain distributions such as Cauchy distributions do not satisfy this condition.
and its Gaussian
0.5 approximation
0.6 for N = 500.
this is verified empirically in Figs. 3 and 4. Hence, assuming that the PDF of the classification index is Gaussian only the first two moments (i.e. the mean and variance) need be evaluated to fully specify the class conditional PDF. Let @(xi) be the vector of autocorrelation terms4 at lags zero and one of the ith training vector of class m; thus the mean of the classification index 6, can be easily determined from the N,,, training vectors that define class m. Hence,
(11)
“The constant term in the Volterra series extended vector is redundant in this formulation as its contribution towards the variance of the classification index is zero.
J.J. Rajan, P.J. W. Rayner 1 Signal Processing
46 (I 995) 57-74
63
5 1 3 4 2 -1 0 Fig. 3. Histogram of the classification index of an ARMA(3,l) process and its Gaussian approximation for N = 100.
Using the example of a state vector of length 2 given in Eq. (6), the variance of the estimated classification index, V(&,), can be written as
where I d n and
Wzl) = wP(%(O)) + 2w,wlCov(~~(o),y^,(l))
g(k) =
w,wnCov(Mr), L(n)).
(12)
The covariance of the autocorrelation defined as
function is
Cov(%(r),L(r9)
k > 0,
0
r-n
I
r-k-n
+ WIW,(l)) = jO “&
k
= WL(r)
- qPmw)l
x CU4 - Jqsn,(4)l~~ (13) An analytic expression for Cov($&,,(r), Qn)) developed by Bartlett [2]:
was
-(N-r-l)
(15)
and $,,(n) refers to the estimated autocorrelation at lag n of the i-th time series of class m (there are N,,, training vectors in class m). Hence, the class-conditional probability density function for the classification index d can be written as
Cov(y*&),Mn)) =-
1
N-n-
1
N2
k= _,L,_
,,‘CN - n - g(k)’
x C%(lkl)%(lk +
+ n - rl)
%dlk + nl)%dlk - 411 ,
where (14)
&n= wo%m + WlYm(l),
(17)
J.J. Rajan, P.J. W. Rayner / Signal Processing
64
46 (1995) 57-74
Fig. 4. Histogram of the classification index of an ARMA(3,l) process and its Gaussian approximation for N = 500.
+ y;n(lk+ 4)r;,W - rl)l)
(18)
The conditional risk is the overall expected loss that is incurred using a specific,decision rule. If each class is equiprobable and a zero-one cost function [ 151 is used then the overall expected loss is simply the average overlap area between the class-conditional PDFs; hence the minimum conditional risk R, can be written as
Rc=&,i .f t-l
J-1
i#i
m is
p(dli)dd+~~~p(dl,ldd},
ei.j
(1%
where ei,j is the intersection of p(d Ii) and p(d Ij) that lies between Zi and 61. The VCM classification
scheme developed here attempts to approximate the optimal Bayes discriminant boundaries by specifying a weight vector that transforms the features to a one-dimensional classification space where the conditional risk can be specified analytically. Varying the weights alters the partitioning of the feature space and the overlaps of the class conditional probability density functions in the classification space. For a given problem the conditional risk is a function of the network weights only; hence, the weight vector wept that minimizes the conditional risk is ‘optimal’, in the sense that it minimizes the total probability of error over the PDFs of the classification indices. The conditional risk R, as a function of the weight vector w is a highly non-linear, multi-modal function. However, in general, the number of features used and correspondingly the number of weights required is fairly small (less than 10) and so any standard iterative optimization procedure can be employed to determine the optimum weight values which correspond to the minimum value of R,. Appendix A gives a brief overview of the Broyden-FletcherGoldfarb-Shanno (BFGS) algorithm which is
J.J. Rajan, P.J. W. Rayner / Signal Processing 46 (1995) 57-74
a suitable optimization routine for use in this problem.’ It should be noted, however, that simply minimizing the conditional risk with respect to w will not necessarily result in the classifier operating at the Bayes rate. The explanation for this requires an examination of the nature of the partitioning of the feature space. A linear weight vector will map the discriminant boundary between two classes in the feature space to the intersection point of the two class conditional PDFs in the classification space via a linear transformation. Thus the use of a linear weight vector will always constrain the geometrical form of the discriminant boundary in the feature space to be a hyperplane. If the PDFs of the features of each class are Gaussian and homoscedastic (that is, the covariance matrices are equal) then the Bayes optimal discriminant boundary will be a hyperplane and the VCM will operate at the Bayes rate. This is illustrated for the two-dimensional case in Fig. 5. If, however, the PDFs are heteroscedastic, then the Bayes discriminant boundary will be hyperquadratic in nature; hence, a linear weight transformation will simply attempt to fit a hyperplane to the Bayes discriminant and the VCM will operate at a misclassification rate greater than the Bayes rate. This is illustrated in Fig. 6. A detailed explanation of the formation of optimal discriminant boundaries can be found in [3]. It is clear from Fig. 6 that with configurations of three or more classes simple hyperplane partitioning of the feature space will be suboptimal and will lead to increased misclassification. In problems where the features of the time series classes are highly heteroscedastic there will be an improvement in classifier performance if a non-linear weight transformation is used. However, in most unsupervised classification problems the covariance matrix of the feature vector must be estimated from each training vector and is only known approximately and hence the use of a complex nonlinear transformation may lead to an overly com-
‘The BFGS optimization routine is not guaranteed to find the global minimum. Hence the routine must be run several times (usually five to ten times) so that a reasonable local minimum can be found.
65
Bayes Discriminant Boundary is a Line
x2
Class 2 mgion of high probability density
Class 1 region of high probability density
J Fig. 5. Bayes discriminant Gaussian PDFs.
boundary
is a line for homoscedastic
Bayes Discriminant Boundary is Class 2 region of a Hyperbola
Fig. 6. Bayes discriminant scedastic Gaussian PDFs.
boundary
is a quadratic
for hetero-
plex discriminant boundary being formed. The use of a simple partitioning scheme results in a classification scheme that is more robust to noisy training data. 4. Determining training vector class membership Initially, use the Bhattacharyya distance or the directed divergence to order the training vectors. At
66
J.J. Rajan, P.J. W. Rayner / Signal Processing 46 (1995) 57-74
this stage the training vectors are crudely clustered. Assume that there exist at least two classes; place TV1 in Class 1 and TVN, in Class 2. Each of these two classes are specified entirely by only one training vector. The class-conditional PDFs for Classes 1 and 2 are defined by the univariate Gaussian PDFs of the classification indices di and dNT,respectively. The VCM weight vector w can be determined by minimizing the overlap area of these two PDFs. Applying the next training vector TV2 to the network gives the PDF of the classification index d2 which specifies the mapping of that feature vector onto the classification space. From this mapping, the degree of similarity between TV, and the other training vectors TV1 and TVN, can be estimated by comparing the overlap areas of their respective univariate Gaussian PDFs. For example, if the overlap area of the PDFs of dz and d1 exceeds a specific threshold then label TV1 as belonging to the class defined by TV1; update the mean feature vector e(x) and covariance matrix KGrpthat relate to that class-conditional PDF. If this PDF overlap does not satisfy the statistical similarity criterion then form a new class, Class 3, which is defined by TV2. In either scenario the weight vector is updated so as to ensure the maximum separability between the existing class conditional PDFs. Consider Class 1 as being defined by the attributes of a single training vector TV,. Let us imagine that TV1 is an ‘outlier’ and that its corresponding classification index di actually lies in the right tail at 1.96 standard deviations from the mean of the true class-conditional PDF that represents Class 1. Assume that the maximum separability of any two training vectors of a given class lie within a 95% confidence interval6 centred about the mean of the true class conditional PDF (i.e. 1.96 standard deviations about the mean). Consider the following worst case situation. The next training vector TV2, after transformation by the weight vector, gives a classification index dz which lies in the left tail of
the true class-conditional PDF at the 95% significance level; hence, TV2 should only be accepted as belonging to Class 1 if d2 lies within the 3.920 (where o refers to the standard deviation of the classification index d,) interval to the left of dl . If d2 lies outside this range then a new class should be formed whose attributes are defined by the training vector TV2. This scheme implicitly assumes that the standard deviation of the estimated class-conditional PDF equals that of the true class-conditional PDF. Consider the general situation where there are n training vectors that have been iabelled as belonging to the same class and a training vector TVi is to be evaluated for membership of that class. Assuming a 95% significance level the maximum deviation of the means of the estimated and true class-conditional PDFs is given by
(c_p,2g, Jn
(20)
where c is the mean of the estimated class-conditional PDF, p is the mean of the true class-conditional PDF and the standard deviation o is that evaluated from the n training vectors that define the estimated class-conditional PDF. Thus, the classification index of training vector TV, must lie within a 1.96~ (1 + (l/&i)) interval adjacent to the mean of the estimated PDF if it is to be considered a member of that class. Since the weight vector is determined by minimizing the overlap area of the estimated class-conditional PDFs it would be more informative if the class membership stage were formulated in terms of an overlap area criterion. Assume as before that the training vector TVi is to be evaluated for membership of a class that already consists of n training vectors. Hence the confidence interval 1.960 (1 + (l/J)) n can be expressed as the overlap area A(n) given by
(21) 6The conventional significance level of 5% is assumed initially. It will be shown in Section 4.2 that this significance level may require modification during the clustering or validation stages, and that this initial choice does not affect the generality of the approach.
where
J.J. Rajan, P.J. W. Rayner 1 Signal Processing 46 (1995) 57-74
Thus if the estimated class-conditional PDF is described by only one training vector (i.e., n = 1) then the overlap area must be at least 5% for class membership to be confirmed (equivalent to a 3.92 standard deviation confidence interval adjacent to the mean of the estimated class-conditional PDF); if the PDF is defined by two training vectors then the overlap must be 9.43%; with three training vectors it must be at least 12.21%; asymptotically as n approaches infinity (i.e., the mean of the estimated class-conditional PDF approaches the mean of the true class-conditional PDF) the overlap must be at least 32.71% (equivalent to a 1.96 standard deviation confidence interval). Hence, if there are few data a very small overlap area may be significant, whereas if there are many data only a large overlap area will be significant. Fig. 7 illustrates the equivalence of the confidence interval and overlap area approaches. The overlap area illustrated in Fig. 7 is referred to as the equal variance overlap since the formulation of the scheme explicitly assumes that the variance of the classification index of the training vector under evaluation equals that of the estimated class-conditional PDF. This assumption will clearly be acceptable only when the training vector under evaluation belongs to the same class as the adjacent estimated class-conditional PDF. Consider the situation in Fig. 8 where the training
True Class Conditional PDF Representative PDF for the Training Vector under
Estimated Class
p-1.960 p
p+‘?
Fig. 7. Overlap area of estimated class-conditional PDFs.
Class PDF for Training Vector
61
True Class Conditional PDF Estimated Class Conditional PDF
Fig. 8. Actual overlap area of estimated class-conditional PDF with that of the training vector under evaluation.
vector under evaluation belongs to a different class and its classification index variance is considerably smaller than that of the adjacent estimated classconditional PDF. In this situation the training vector lies within the 1.96(1 + (l/d)) confidence interval and if the overlap area was determined using a variance equal to that of the estimated classconditional PDF then the required threshold would be exceeded and the training vector would be labelled as belonging to that class. However, it seems reasonable to assume that training vectors that belong to the same class, as well as having similar autocorrelation terms, should also have similar autocorrelation variances. The variance of the classification index of the training vector under evaluation can be estimated; hence, this information should be used in order to prevent training vectors with considerably different autocorrelation variances being incorporated into the same class. Thus an additional requirement should be specified: that the actual overlap area between the PDF of the classification index of the training vector under evaluation and the estimated class-conditional PDF must also exceed the required overlap area threshold. If the training vector does actually belong to the same class as the estimated class-conditional PDF then the classification index variances will be similar and the actual
J.J. Rajan, P.J. W. Rayner / Signal Processing 46 (1995) 57-74
True Class ConditionalPDF Estimated Class Class PDF for
ConditionalPDF
the TrainingVector under Evaluation
Fig. 9. Actual overlap area of estimated class-conditional PDF with that of the training vector under evaluation.
overlap area and the equal variance overlap area will be very similar and both overlap areas should exceed the required threshold. If the classification index variance of the training vector to be evaluated is larger than that of the estimated class-conditional PDF then the actual overlap area may exceed the threshold area requirement as shown in Fig. 9. However, if the training vector classification index lies outside the 1.96 (1 + (l/J)) n confidence interval then the equal variance overlap area estimate will not satisfy the class membership conditions and the training vector will be correctly labelled as belonging to a new, different class. In summary, the covariance information relating to the autocorrelation terms of each training vector is implicitly included in the formulation in that both overlap area estimates must exceed the threshold overlap area requirement. If this requirement is not met then a new class is formed. This procedure is repeated for all the remaining training vectors.
4.1. Specification of threshold value The classes that have been formed are dependent on the initial threshold assumed (i.e. the 95% confi-
dence interval). It should be noted that this threshold must be set so that it exceeds the total overlap of any two of the class-conditional PDFs. If this threshold is exceeded by the class-conditional PDFs during the class formation stage then the procedure must be restarted with a lower threshold value. A very large initial threshold value ensures that outliers in the classes will be correctly classified; however, it will also tend to penalize class formation which could result in poor identification of underlying class structure. A very small threshold value, on the other hand, will tend to encourage class formation and misclassify outlying training vectors. Ideally, specification of the threshold value should be based in terms of the maximum separability of training vectors that can be considered as belonging to the same underlying class; in practice, of course, no objective maximum separability criterion can be specified since the variability of the training vectors in each class will often not be the same (i.e., the training vectors of one class may all lie within one standard deviation of the mean, whereas those of another class may lie within several standard deviations of the mean). In general, a threshold value that specifies a 95% confidence interval is usually a reasonable first choice that requires alteration only if it is exceeded by the overlap of any two of the class-conditional PDFs, or if it underestimates the maximum separation of any two training vectors in any class. If this situation does occur then the clustering should be restarted with a different threshold value. Hence, although the threshold value must be set initially to a somewhat arbitrary value, it can be validated and, if need be, changed at a later stage to ensure that the clustering and partitioning of the feature space is reasonable.
4.2. Validation of class membership At this stage M classes have been formed where each defining class-conditional PDF is specified entirely by the training vectors which have been assigned to that class. The separability of the classes is specified by the VCM weight vector. However, these class assignments of the training
J.J. Rajan, P.J. W. Rayner J Signal Processing 46 (1995) 57-74
vectors have been based on knowledge gleaned from only a subset of the total number of training vectors. Hence, a necessary final stage of the classification scheme is some form of validation of the class membership of the training vectors. This can be achieved by individually removing the ‘effect’ of each training vector from its appropriate class-conditional PDF, recalculating the VCM weight vector and re-applying the training vector to the network. This application of the training vectors to the ‘trained’ network attempts to ensure that any of the training vectors that were initially labelled incorrectly can subsequently be categorized correctly. As before, any change that is made to the class membership of the training vectors necessitates a recalculation of the VCM weight vector.
5. Summary of algorithm 1. Initially order the NT training vectors using the extracted autocorrelation feature vector and the directed divergence or Bhattacharyya distane measure. 2. Assign training vector TV, to Class 1 and TV,, to Class 2. 3. Apply the VCM which specifies Gaussian classconditional PDFs in terms of the means and variances of the classification indices of TV, and TV,,. Determine the VCM weight vector by minimizing the overlap area of these class-conditional PDFs. 4. Choose a confidence interval (typically 95%). Assign the remaining training vectors to one of the existing classes or form a new class using the equal variance and actual variance overlap area criteria. Update the weight vector, and the mean and variance of the appropriate classification index after each change made to the class membership. 5. If the overlap area of any of the class-conditional PDFs exceeds the chosen confidence interval then restart the algorithm at step 4 with a different confidence interval. 6. Re-apply the training vectors to the ‘trained VCM network and reassign the training vectors appropriately such that the overall number of misclassifications is minimized.
69
6. Results In this section the results of two numerical experiments are given: one experiment is on synthetic autoregressive (AR) data; the other experiment involves real data signals, in particular the vibration signals of a drill bit in an oil well. In the application of the unsupervised VCM classification scheme no initial assumptions were made as to the number of clusters present. For each problem a comparison with the standard K-means clustering algorithm [3] is given. The K-means clustering algorithm is one of the most popular clustering algorithms; it is simple to implement, fast and has the ability to produce good results. The general K-means clustering algorithm begins with some arbitrary choice of K training vectors as cluster centers {mi: i = 1,2, . . . , K}. Each traning vector @(Xi) is assigned to the nearest (in the Euclidean sense) cluster centre. The sample mean of the clusters is then updated appropriately. After the clusters have been formed each training vector is then reassigned to the cluster with nearest mean. This procedure is repeated until there is no change in the cluster membership. In the application of the K-means clustering algorithm described in the experiments below, it was assumed that the correct number of classes was known a priori; and the centres of the clusters were set to be the training vectors ‘closest’ to the true means of the classes.
6.1. Classijkation of AR (2) data Table 1 lists the poles (ei-0,) of eight AR(2) time series driven by Gaussian noise of variance 1. Fig. 10 shows the pole positions of the eight AR(2) time series. In this problem a network with four weights was used to cluster a training set of 80 time series vectors (10 training vectors for each class; each vector being of length 300 observations). The first four autocorrelation terms were chosen as features (i.e. a(x) = [y(O) y(l) y(2) y(3)]). The class membership of the training vectors is given in Table 2. The training vectors were initially clustered using the directed divergence measure; no assumptions
70
J.J. Rajan. P.J. W. Rayner f Signal Processing 46 (1995) 57-74
Table 1 Pole positions of the eight AR(2) processes
Table 2 True class membership of training vectors for the AR(2) data
Class
Class
Training vectors
1 2 3 4 5 6 7 8
‘-10 1l-20 21-30 31-40 41-50 51-60 61-70 71-80
0 1.2
-
0.10 _+0.7OOOi 0.26 + 0.65761 0.42 5 0.56891’ 0.58 k 0.40451’ 0.10 + 0.7OoOi 0.26 k 0.65761‘ 0.42 + 0.56891’ 0.58 k 0.40451’
3
4 *
5
l
l
*
Table 3 Class membership of training vectors after clustering using the VCM for the AR(2) data
6 .
7 L
u
l
8
4
l .
-1
-0.5
0
0.5
1
Fig. 10. Plot of the pole positions of eight AR(2) time series.
were made regarding the number of classes present. Using a threshold value of zT = 1.96 that effected a 95% confidence interval about the mean between training vectors of the same class, eight distinct clusters were located and the training vectors were subsequently assigned according to Table 3. Table 3 shows that the correct number of classes has been identified. However, three of the training vectors in Class 1 are wrongly given membership in Class 2. The training vectors and the VCM weight vector of the four tap network define the eight class-conditional PDFs. On applying the training vectors to this network the three wrongly classified
Class
Training vectors
1 2 3 4 5 6 7 8
l-10, 13, 14,20 11, 12, 15-19 21-30 31-40 41-50 51-60 61-70 71-80
training vectors in Class 1 were correctly placed in Class 2. The weight vector was then recalculated and the class-conditional PDFs redefined; reapplying the training vectors to the updated network resulted in no misclassifications. The final clustering of the training vectors is given in Table 4. The estimated probability of error (i.e. the average overlap area of the class-conditional PDFs) is 0.47%. Fig. 11 shows the class-conditional PDFs of the eight AR(2) processes. Applying the K-means algorithm (where K = 8) where the centres of each class are initially set to the training vector ‘closest’ (in the Euclidean sense) to the true centre of the clusters. The training vectors were clustered as shown in Table 5. The resultant class membership of the training vectors for Classes l-4 was correct but that for the other classes was not. In particular the vectors clustered to form Classes 5-7 contained mixtures of classes while Class 8 contained only a subset of its training vector membership.
J.J. Rajan, P.J. W. Rayner / Signal Processing 46 (1995) 57-74
71
p(d I CLASS 5)
2.5
p(d I CLASS 2)
p(d I CLASS 3)
A CLASSIFICATION INDEX d Fig. 11. Class-conditional
Table 4 Class membership of training vectors after clustering and validation using the VCM for the AR(2) data Class
Training vectors
I
l-10 1l-20 21-30 31-40 41-50 51-60 61-70 71-80
2 3 4 5 6 7 8
6.2. Classijcation of rock data Table 6 shows the class membership of 80 time series training vectors of four rock types (shale, marble, sandstone and slate) using vibration signals obtained close to the drill bit in an oil well. In this problem the directed divergence measure was used to order the 80 time series training vectors (20 training vectors for each class; each vector being of length 100 observations). The feature vector used
PDFs of eight AR(2) time series.
Table 5 Class membership of training vectors using K-means clustering for eight class AR(2) data Class
Training vectors l-10 1l-20 21-30 31-40 41-51, 54, 55, 57, 59 52, 53, 56, 58,60-62, 65-68 63, 64, 69-71, 74 72, 73, 75-80
was G(x) = [y(O) y(1) y(2) y(3)]; as before, an initial class membership threshold value of zr = 1.96 was assumed. A four weight VCM was used to effect the feature to classification space transformation. Applying the algorithm, the training vectors were correctly clustered into four distinct classes with each of the training vectors being given correct class membership as shown in Table 7. The final class membership after applying the validation stage of the
J.J. Rajan, P.J. W. Rayner / Signal Processing 46 (1995) 57-74
12
Table 6 True class membership of rock type training vectors Class
Training vectors
1: Shale 2: Sandstone 3: Slate 4: Marble
l-20 21-40 41-60 61-80
Table I Class membership of rock type training vectors after clustering using the VCM Class
Training vectors
1 2 3 4
l-20 21-40 41-60 61-80
Table 8 Class membership of training vectors using K-means clustering for four classes of rock data Class
Training vectors
1 2 3 4
l-20,63,69,15 21-40 41-60,64,68, IO, 14,16,80 61,62,65-61,11-13, II-19
algorithm remained the same as that given in Table 7. The K-means algorithm (where K = 4) was applied to the training vectors with the centre of each class initially set to the training vector ‘closest’ (in the Euclidean sense) to the true centre of the clusters. The training vectors were clustered as shown in Table 8. Clearly, the clustering is incorrect; only the training vectors given membership of Class 2 corresponded correctly with the class ‘sandstone’. Class 1 contains all the vectors of class ‘shale’ along with three vectors from class ‘marble’. Class 3 contains all the vectors from class ‘slate’ with six vectors from class ‘marble’. Class 4 consists of a subset of the vectors belonging to class ‘marble’.
7. Conclusion A given time series has, in general, a correlation structure which often distinguishes it from other time series and hence autocorrelation terms can be useful as discriminatory features where little a priori information is available. Assuming ergodicity it is possible to obtain covariance information relating to the autocorrelation features of each time series training vector; the incorporation of this feature covariance information is the main advantage of this classification scheme over standard techniques. It should be noted however that, as with all unsupervised learning schemes, the chosen features may lead to ‘wrong’ results if they do not contain sufficient information to enable proper discrimination of the training vectors. Each autocorrelation feature is weighted using a connectionist network which enables a univariate Gaussian class-conditional PDF of the classification index of each training vector to be specified. This allows us to compare the statistical similarity of the training vectors relative to each other and form distinct clusters in the classification space. The use of a threshold value specifies explicitly the degree of statistical similarity required in order to achieve good classification using the given features. This threshold value, although initially arbitrarily set, is validated and may be altered at a later stage to ensure effective categorization of the training vectors. Results on synthetic and real data problems demonstrate that the scheme performs better than the standard K-means clustering algorithm even when the number of clusters and the cluster centres are known. The VCM classification scheme is however considerably more computationally intensive. The problem could also be cast in a fully Bayesian framework by considering the PDFs of the autocorrelations of the entire training set as forming a multi-dimensional Gaussian mixture PDF. An identification procedure could then be carried out to determine the number of distinct peaks or clusters in the feature space that are actually present (this bears some similarity to the classification approach adopted using Probabilistic Neural Networks [7,11-131). This in many ways is a more elegant approach but suffers the
J.J. Rajan, P.J. W. Rayner /
Signal
major drawback of having to assign prior probabilities for the existence of each class. These prior probabilities would act so as to penalize or encourage the formation of new classes; in general, the assignment of these prior probabilities would be a highly non-trivial problem. In addition, dealing with the multidimensional features explicitly has the disadvantage of requiring numerically intensive analysis methods. It is generally easier to deal with the scalar output (classification indices) of the VCM than the feature vectors themselves. The weights of the VCM atempt to maximize the within-class similarity and the between-class dissimilarity of the clusters which tends to result in better discrimination of the training vector classes. Another advantage is that in the classification space it is relatively easy to assign an initial threshold (specifying a confidence interval) for the class-conditional PDFs based on conceptually well understood ideas. There is also the advantage that the method results in a fully trained network with a weight vector completely determined rather than just a description of the clustering behaviour of the training vectors in the feature space.
Processing 46 (1995) 57-74
73
The gradient vector b and the Hessian matrix A are respectively defined as b = -VI,=,,,
(23) (24)
Differentiating Eq. (22) with respect to x we obtain Vf(x)=
--b+Ax.
(25)
At the minimum point x,, the gradient Vf(x,,,) is zero; thus if A is known exactly then Ax,=b
(26)
and x, can be obtained in a single jump. The variable metric method attempts to iteratively build up an approximation to the inverse Hessian matrix A - 1 by constructing a sequence of matrices Hi with the property limi,W Hi = A-‘. The Hi’s are determined such that they are always symmetric and positive definite which guarantees that we always move in a downhill direction. At the current point Xi we can write Axi = Vf(Xi) + b.
Acknowledgements
The authors would like to thank the anonymous reviewers for their useful comments and suggestions which helped improve the readability and quality of this paper. Appendix A. Broyden-Fletcher-GoldfarbShanno optimization algorithm
(27)
Subtracting Eq. (26) from Eq. (27) and pre-multiplying by A _ ’ we obtain X, - Xi = A - ’ [ - Vf(Xi)]
.
(28)
Subtracting Eq. (28) at Zi+l from that at xi and substituting Hi + 1 for the inverse Hessian A - ’ gives the new update (by appropriate substitution for
-%+I) Xi+1=xi +Hi+,[Vf(~i+~)-vf(xi)l.
The Broyden-Fletcher-Goldfarb-Shanno
(BFGS) algorithm is generally recognized as empirically being one of the ‘best’ of the variable metric or quasi-Newton type of optimization scheme. In common with variable metric and conjugate gradient methods we assume that our function f(x) can be locally approximated about our current point X~ by a quadratic form (this is usually its truncated Taylor series expansion); thus, (22)
(29)
Invoking a standard line minimization routine to determine Zi + 1, the inverse Hessian Hi+ 1 can be updated from the somewhat formidable expression Hi+1 =Hi
+
(Xi+1-xi)@(-fi+l -Xi) (%+I
-xi)‘(vA+l
-
Vf,)
1 - vl;)l +1 - vJ;:)l0 CHiWL+ - CHitvf, (VA+1-vh)Hi(Vf;,+I -vh,) 1 Vji))]U 0 u, + [(VA+1 - vfi) Hi(Vfi+ (30)
J.J. Rajan, P.J. W. Rayner 1 Signal Processing 46 (1995) 57-74
14
where (zi+l
-Xi)
Hi(vh+
-(v.h+l
-vJ)Hi(Vf,+l
1 -
vh) -V.h)
and Vf,+l = Vf(Zi+ 1); Vf = Vf(Xi) and @ is the Kronecker product [S]. A detailed proof of the update formulae is given by Polak [8].
References [l] P. Alper, “A consideration of the discrete Volterra series”, IEEE Trans. Automat. Control, Vol. 10, No. 3, July 1965, pp. 322-321. [Z] MS. Bartlett, “On the theoretical specification and sampling properties of autocorrelated time series”, J. Royal Statist. Sot., Series B, Vol. 8, No. 1, 1946, pp. 28-41. [3] R.O. Duda and P.E. Hart, Pattern Classijication and Scene Analysis, Wiley, New York, 1973. [4] A.D.M. Garvin and P.J.W. Rayner, “Probabilistic selfstructuring and learning”, Proc. CLNL’93, September 1993. [S] A. Graham, Kronecker Products and Matrix Calculus, Harwood, New York, 1981.
[6] M.G. Kendall and A. Stuart, The Advanced Theory of Statistics, Griffin, London, 1966, Vol. 3. [7] M.T. Musavi, K. Kalantri, W. Ahmed and K.H. Chan, “A minimum error neural network (MNN)“, Neural Networks, Vol. 6, 1993, pp. 397-407. [S] E. Polak, Computational Methods in Optimization, Academic Press, New York, 1971. [9] J.J. Rajan and P.J.W. Rayner, “Time series classification using the Volterra Connectionist Model and Bayes decision theory”, Proc. IEEE Internat. Conf. Acoust. Speech Signal Process., Vol. 1, April 1993, pp. 601-604. [lo] J.J. Rajan, Time series classification, Ph.D. Thesis, University of Cambridge, October 1994. [l l] D.F. Specht, ‘Probabilistic neural networks”, Neural Networks, Vol. 3, 1990, pp. 109118. [12] D.F. Specht, “Probabilistic neural networks and the polynomial adaline as complementary techniques for classification”, IEEE on Neural Networks, Vol. 1, No. 1, March 1990, pp. 111-121. [13] R.L. Streit and T.E. Luginbuhl, “Maximum likelihood training of probabilistic neural networks”, IEEE Trans. Neural Networks, Vol. 5, No. 5, September 1990, pp. 764-783. [14] C.W. Therrien, Decision, Estimation and Classijication, . Wiley, New York, 1989. [l S] H.L. Van Trees, Detection, Estimation and Modulation Theory, Wiley, New York, 1968. [16] J. Wishart, “The generalized product moment distribution in samples from a normal multivariate population”, Biometrika, Vol. 20a, 1928, pp. 32-52.