Facial Motion Analysis for Content-based Video Coding

Facial Motion Analysis for Content-based Video Coding

Real-Time Imaging 6, 3±16 (2000) doi:10.1006/rtim.1998.0152, available online at http://www.idealibrary.com on Facial Motion Analysis for Content-bas...

810KB Sizes 3 Downloads 161 Views

Real-Time Imaging 6, 3±16 (2000) doi:10.1006/rtim.1998.0152, available online at http://www.idealibrary.com on

Facial Motion Analysis for Content-based Video Coding

A

utomatic wire-frame ®tting and automatic wire-frame tracking are the two most important and most dicult issues associated with semantic-based moving image coding. A novel approach to high speed tracking of important facial features is presented as a part of a complete ®tting-tracking system. The method allows real-time processing of head-andshoulders sequences using software tools only. The algorithm is based on eigenvalue decomposition of the sub-images extracted from subsequent frames of the video sequence. Each important facial feature (the left eye, the right eye, the nose and the lips) is tracked separately using the same method. The algorithm was tested on widely used head-and-shoulders video sequences containing the speaker's head pan, rotation and zoom with remarkably good results. These experiments prove that it is possible to maintain tracking even when the facial features are partially occluded. # 2000 Academic Press

P.M. Antoszczyszyn, J.M. Hannah and P.M. Grant Department of Electrical Engineering, The University of Edinburgh, King's Buildings, May®eld Road, Edinburgh EH9 3JL, U.K.

Introduction Extremely low bit-rate moving image transmission systems are designed to handle video communication in environments where data rates are not allowed to exceed 10 kbit/s. These include PSTN lines in certain countries and mobile communication. It is widely acknowledged that currently available systems are not capable of delivering a satisfactory quality of moving image at such low data-rates. Most of these methods are derived from a block-based motion compensated discrete cosine transform [1]. At extremely low bit-rates (below 9600 bit/s), application of block-based algorithms results in disturbing artefacts [2]. Also, the frame rate is often reduced to about 10 frames/s at QCIF resolution. Despite introduction of other moving image coding techniques based on vector quantization [3], fractal theory [4] and wavelet analysis [5], it is still not

1077-2014/00/020003+14 $35.00/0

possible to send video over extremely low bit-rate channels with acceptable quality. A promising approach to scene analysis was proposed by Musmann et al. [6,7]. However, it seems that only the application of semanticbased techniques will allow transmission at extremely low bit rates. Aizawa et al. [8] and Forchheimer [9] independently introduced semantic wire-frame models and computer graphics to the area of extremely low bit-rate communication. According to their assessments, it is possible to obtain data-rates below 10 kbit/s for head-and-shoulders video sequences. The concept of semantic-based communication can be brie¯y explained in the following way. A semantic model of the human face (e.g. Candide±Figure 1) is shared by the transmitter and the receiver. With each subsequent frame of the video sequence the positions of the vertices of the

#

2000 Academic Press

4

P.M. ANTOSZCZYSZYN ETAL.

Eigenvalue Decomposition for Sequence of Images

Figure 1. Candide wire-frame model.

wire-frame is automatically tracked. The initial and subsequent positions of the wire-frame are transmitted in the form of 3D co-ordinates over the low bit-rate channel along with the texture of the face from the initial frame of the sequence. Knowing the texture of the scene from the initial frame and the 3D positions of vertices of the wire-frame in subsequent frames, it is possible to reconstruct the entire sequence by mapping the texture of the initial frame of the sequence at locations indicated by the transmitted vertices. Our e€orts are focused on techniques utilizing semantic models of the head-and-shoulders scene using the Candide [9] semantic wire-frame (Figure 1). The two main problems in content-based moving image coding are automatic ®tting of the wire frame and automatic tracking of the scene. A limited number of approaches to automatic-wire-frame ®tting have been proposed. The method proposed by Welsh [10] utilizes the idea of `snakes' (active contours). Di€erent approaches were presented by Reinders et al. [11] and Seferidis [12]. Our contributions [13,14] proposed the application of a facial data-base. However, the issue of automatic tracking still remains unresolved. Existing proposals employ optical ¯ow analysis [15] and correlation [16]. A major development of an earlier methods [17], proposed as a part of a complete automatic tracking-®tting system [18] is presented here. As the model-based concept is computationally intensive ± especially on the transmitter side ± e€ort has been focused on creating an algorithm capable of tracking the sequence in real-time. This has been achieved by reducing the dimensionality of the analysed vector space and the application of principal component analysis in conjunction with singular value decomposition. As will be shown in the following sections, the algorithm has been tested on widely used video sequences with remarkable results (results available from Internet site: http://www.ee.ed.ac.uk/plma/).

Principal component analysis is also referred to as the Hotelling [19] transform, or a discrete case of the Karhunen-LoeÁve [20,21] transform. The method of principal components is a data-analytic technique transforming a group of correlated variables such that certain optimal conditions are met. The most important of these conditions is that the transformed variables are uncorrelated. It will be shown that the method of principal components is a reliable mathematical tool for tracking the motion of facial features in head-andshoulders video sequences. In the ®rst step of the method of principal components, the eigenvectors of the covariance (dispersion) matrix S of the sequence X of M; N-dimensional input column vectors: X ˆ ‰x1 x2    xM Š; xj ˆ ‰xji Š; i ˆ 1::N; j ˆ 1::M; must be found. In this case the input sequence consists of images (2D matrices). They have to be converted into one-dimensional (1D) column vectors. This can be done by scanning the image line by line or column by column. An image consisting of R rows and C columns would therefore produce a column input vector consisting of N ˆ C  R rows. The covariance matrix can be obtained from the following relationship (T denotes transposition): S ˆ YYT

…1†

where the columns of matrix Y are vectors yj that di€er from the input vectors xj by the expected value mx of the sequence X: Y ˆ ‰x1 ÿ mx

x 2 ÿ mx

mx ˆ



M 1 X xi M iˆ1

xM ÿ mx Š

…2† …3†

If Eqn (1) is developed using Eqns (2) and (3), the following symmetric non-singular matrix will be obtained: 2 2 3 s1N s1 s12 6 7 6 s12 s22    s2N 7 6 7 Sˆ6 . …4† .. .. .. 7 6 .. . . . 7 4 5 s1N s2N    s2N where s2i is the variance of the ith variable and sij ; i 6ˆ j is the covariance between the ith and jth variable. The images in the analysed sequence will be correlated,

FACIAL MOTION ANALYSIS therefore the elements of the matrix S that are not on the leading diagonal will be non-zero. The objective of the method of principal components is to ®nd the alternative co-ordinate system Z for an input sequence X in which all the elements o€ the leading diagonal of the covariance matrix SZ (where index Z denotes the new co-ordinate system) are zeros. According to matrix algebra, such a matrix can be constructed if the eigenvectors of the covariance matrix are known: SZ ˆ UT SU

…5†

where U ˆ ‰u1 u2 . . . uN Š and ui ; i ˆ 1::N, is the ith eigenvectors of covariance matrix S. The eigenvectors can be calculated from the following relationship: S ui ˆ li ui

…6†

where li is the eigenvalue of the ith eigenvector, i ˆ 1::N. As can be seen from Eqn (6), calculation of the eigenvectors involves operations on the covariance matrix S. Even if small images are used as an input sequence, the size of the covariance matrix can be too large to handle by common computing equipment (e.g. a sequence of images consisting of 100 columns and 100 rows would result in a 1002  1002 covariance matrix). This would e€ectively prohibit real-time implementation. If, however, the number of images M in the sequence X is considerably smaller than the dimension of the images themselves (N ˆ C  R), it is possible to reduce the computational e€ort considerably by application of singular value decomposition (SVD) [22]. SVD allows us to express the eigenvectors of matrix S ˆ YYT as linear combination of a matrix C ˆ YT Y. Since matrix C is M  M, the computational costs of ®nding the eigenvectors of the matrix S are greatly reduced (in this application M520) . The use of the SVD technique is a key feature in achieving real-time performance from this algorithm. The Hotelling transformation can be expressed by the following relationship: zi ˆ uTi …xi ÿ mx †

…7†

Z ˆ UT Y

…8†

or in vector form:

where zi is the ith principal component. If both sides of Eqn (8) are pre-multiplied by matrix U and the

5

orthonormality of matrix U is taken into account (thus UT ˆ Uÿ1 ), the reverse transformation will be obtained: Y ˆ UZ

…9†

The U matrix was derived using the input sequence X. If the analysed image was not originally a part of the input sequence, it can no longer expect to have an optimal principal component representation. However, principal component analysis can still be used for classi®cation of unknown images which are relatively similar to the images from the input sequence. The di€erence between the unknown image and all the images used to generate the principal component space can be described by a speci®c distance measure. The ability to analyse unknown images using a ®xed set of training images is utilized in face recognition algorithms [23].

Automatic Tracking Algorithm The successive frames of a head-and-shoulders video sequence are relatively similar to each other. All frames of such a sequence have one feature in common: they contain the face of the speaker. While classi®cation of the image as a face does not present any problem for the human brain, it is extremely dicult for a machine to perform the same task reliably. However, an image of a face contains certain distinct features (the left eye, the right eye, the nose and the lips) that can be identi®ed automatically. These facial features will be referred to as important facial features. Each facial feature should be tracked separately so that its 2D co-ordinates could be used to determine the current position of the speakers head. The input matrix X is formed from the sub-images containing important facial features (Figure 2, left) extracted from the M initial frames of the analysed sequence (using the method described in [18]). The matrix X is formed separately for each facial feature. In this case there are four X matrices: for the left eye, the right eye, the nose and the lips. If, for example, the number of initial images was 16 and the dimensions of the extracted sub-images were 50  50 then the M and N in Eqns (2)±(4) would be 16 and 2500, respectively. The automatic tracking commences with frame M+1. The initial position of the tracked facial feature in frame M+1 (current frame) is assumed to be the same as in frame M (previous frame). This view is subsequently

6

P.M. ANTOSZCZYSZYN ETAL.

Figure 2. Tracking system.

veri®ed in the following way. Sub-images within the search range are extracted for the current frame (e.g. a 15  15 search range would result in 225 sub-images). In order to aviod confusion these images are referred to as the extracted set (Figure 2). The term initial set will be used to describe the sub-images extracted from the M initial frames (matrix X). The dimensions of the images from the extracted set are identical to those from the initial set. Since the extracted set images are similar to those from the initial set, it can be assumed that they can be projected onto the principal component space created by input vector X. The further analysis of the extracted set is performed in R-dimensional (R  M) principal component space. The goal is to ®nd, among the images in the extracted set, the image that is most likely to contain the required feature. This image is referred to as a best match image. Its 2D co-ordinates will mark the position of the tracked facial feature in the current (M+1) frame.

where ai is the projection of the ith image vector from the initial set onto K-dimensional principal component space according relationship (7), and bj is projection of the jth image from the extracted set onto the same space. The distance eij is calculated for every possible pair of images from the initial set and the extracted set. The jth image from the extracted set for which the eij distance is minimal is the best match image. Its 2D co-ordinates mark the position of the particular facial feature in the next frame.

It is worth noting that the dimensionality of the principal component space can be reduced to less than M, as correct recognition of the most similar sub-image is required, rather than its reconstruction, Analysis of the variance of a particular principal component (i.e. the absolute value of the characteristic root associated with it) can lead to its elimination from further processing. This approach reduces further the execution time for the algorithm.



M X

zi u i di ˆ yi ÿ

iˆ1

The most straightforward method of establishing the di€erence between two images is calculation of the Euclidean distance separating them in the principal component space created by vector X: eij ˆ k ai ÿ bj k

…10†

A di€erent method of derivation of the best match image has also been tested. The best result found using relationship (10) points to the image which is the most similar to one particular image from the initial set. However, it seems that ®nding the image that is most similar to all the images from the initial set might provide a better measure of distance. A suitable distance measure can be obtained using the following equation:

…11†

where yi denotes the normalized ith image from the extracted set, ui the ith eigenvector of covariance matrix S, and zi the ith principal component. The more similar the unknown image is to the initial set, the smaller will be the distance di . Large di€erences between the input and output images will result in a large value of di . Similarly to the previous case, the ith image from the extracted set for which the di distance is minimal is the best match image, and its 2D co-ordinates mark the position of the particular facial feature in the next frame.

FACIAL MOTION ANALYSIS Initially e€orts were focused on tracking the geometrical centers of important facial features. In particular terms this means tracking of the center of the corona of the eye, mid-distance between the nostrils and the centre of the lips. Co-ordinates of the important facial features tracked throughout the sequence form a rigid triangle (e.g. the left eye, the right eye and the nose, or the left eye, the right eye and the lips). It is therefore possible to deduce global motion (motion of the speaker's head) from the motion of this triangle.

Experimental Results The tracking algorithm has been tested on head-andshoulders video sequences with variable amount of motion, frame size and sequence length. Experiments were carried out in three stages. In the ®rst (preliminary) stage a reliable distance measure using one of the video sequences was established. In the second (main) stage the overall performance of the tracking algorithm on all the video sequences was tested. In the ®nal (concluding) stage, the suitability of the algorithm for tracking the shape of particular facial features was examined. In an attempt to determine the best measure of distance, the algorithm was tested using Eqn (10) as method A and Eqn (11) as method B. The Miss America (CIF-sized) video sequence was used for this purpose. In both experiments tracking was maintained throughout the entire sequence for all facial features, as shown by subjective tests using a created movie with white crosses centered on the tracked points of the speaker's face. The centers of crosses were constantly kept within the corona of the eye (when visible), between the nostrils, and in the centre of the lips. Tracking was maintained even when the speaker closed her eyes (quite frequently

7

in the Miss America sequence) or opened and closed her lips. However, subjective observation showed the results obtained using method B to be more acceptable: the crosses centered on facial features moved smoothly and the response on head pan was immediate. In order to more precisely assess the accuracy of the tracking method, the 2D locations of the important facial features were extracted manually from all 150 frames of the Miss America sequence. The error was measured as the Euclidean distance between the 2D coordinates of the features tracked automatically (in methods A and B) and manually. Tracking error pro®les are presented in Figures 3±6. As it can be seen from Table 1, the application of method B gives superior results. The mean error in method B is less than one pixel for all the facial features. This result can be explained in the following manner. Since Eqn (11) describes the distance of the unknown image from the entire initial set, a random recognition error (which cloud take place both in method A and method B) should be averaged over all the images from the initial set in method B. However, the result in method A is based on the distance between two distinct images: one from the initial set and from the extracted set which is likely to give a greater average error. Because of its superior performance, method B was employed for the all subsequent work. The head-andshoulders video sequences used for further testing are listed in Table 2; all of these are readily available from well known WWW sites. These sequences re¯ect a wide variety of possible head-and-shoulders videophone situations containing moderate speaker's head pan, rotation and zoom. CIF-sized sequences (Miss America, Claire, Salesman, Trevor) as well as QCIF sequences (Car Phone, Grandma) are included. The subjects of two

Figure 3. Method A (left) and Method B (right) left eye tracking error pro®les.

8

P.M. ANTOSZCZYSZYN ETAL.

Figure 4. Method A (left) and Method B (right) right eye tracking error pro®les.

Figure 5. Method A (left) and Method B (right) nose tracking error pro®les.

Figure 6. Method A (left) and Method B (right) lips tracking error pro®les.

video sequences (Trevor, Grandma) wear glasses, and the Car Phone sequence was shot in moving car. The algorithm produced reliable and consistent results for all tested sequences. In all tests, tracking of all important facial features was maintained throughout

the entire sequence. In order to obtain a measure of accuracy of the algorithm, a method similar to the one described above was used. However, because of the length and number of test sequences, only the 2D positions of the important facial features from every ®fth frame of each sequence were extracted. This

FACIAL MOTION ANALYSIS resulted in 20 frames in case of Trevor (lower limit) and 154 frames in case of Grandma. Stills which illustrate the range of movement of the facial features in the test sequences are presented in Figures 7±12. Also, the error pro®les for the sequence with the lowest (Grandma) and the highest (Car Phone) amount of motion are provided

9

(Figures 13±16). The resultant mean error and standard deviation are given in Tables 3±5. As can be seen, the mean error for all the facial features in almost all the sequences was less than 1 pixel. The left and the right eye seem to be the most reliably tracked facial features, possibly due to combination of very light (white of the

Table 1. Mean tracking error and standard deviation: Miss America Method A (150 frames) Facial feature Left eye Right eye Nose Lips

Method B (150 frames)

Mean error [pixels]

Standard deviation [pixels]

Mean error [pixels]

Standard deviation [pixels]

0.707 1.324 1.230 1.833

0.606 1.232 0.828 1.367

0.633 0.944 0.548 0.712

0.611 0.749 0.606 0.623

Table 2. Sequences tested in the main stage Sequence Miss America Claire Car Phone Grandma Salesman Trevor

Horizontal size [pixels]

Vertical size [pixels]

Length [frames]

Sample [frames]

352 360 176 176 360 256

240 288 144 144 288 256

150 168 400 768 400 100

30 34 80 154 80 20

Figure 7. Tracking Miss America frames: 0, 20, 85, 98, 110, 120.

10

P.M. ANTOSZCZYSZYN ETAL.

Figure 8. Tracking Claire frames: 0, 75, 95, 105, 140, 155.

eye) and very dark (corona of the eye) pixels. However, the track was also maintained during periods when the speakers close their eyes (Miss America, Car Phone). There were no problems with tracking eyes occluded by glasses (Trevor, Grandma). Also, the algorithm coped well with temporary occlusions (Salesman). Especially in the case of QCIF sequences (Car Phone, Grandma), the reference manual ®tting can also introduce additional

Figure 9. Tracking Car Phone frames: 80, 105, 170, 200, 290, 320.

error, since it is sometimes dicult to judge the position of the middle of the particular facial feature. However, the overall results clearly demonstrate that the algorithm is able to maintain tracking of important facial features for a variety of head-and-shoulders sequences. The full sequences illustrating the performance of the algorithm are available from the following Internet site: http:// www.ee.ed.ac.uk/plma/.

FACIAL MOTION ANALYSIS

11

Figure 10. Tracking Grandma frames: 120, 180, 280, 340, 410, 440.

In the ®nal part of the experiment the e€ectiveness of the algorithm for tracking the shape of individual facial features was investigated. Since the Candide wire-frame was utilized in order to reconstruct the local motion (e.g. lips close-open, eyes close-open) the motion of the vertices assigned to the selected facial features must be tracked reliably. In the case of the eyes, the eye-brows and the lips this involves tracking

Figure 11. Tracking Salesman frames: 0, 120, 140, 210, 340, 351.

of at least four vertices (Figure 17). The same algorithm was utilized, but this time the initial set images were centered on the points of the image that corresponded to the positions of the wire-frame vertices of a particular facial feature (Figure 18, left; Figure 2, right). Even if the shape of the facial features changes radically, the contents of the image from the initial set will change only slightly (Figures 19 and 20).

12

P.M. ANTOSZCZYSZYN ETAL.

Figure 12. Tracking Trevor frames: 0, 20, 30, 40, 50, 70.

Figure 13. Car Phone: the left eye (left) and right eye (right) tracking error pro®les.

Figure 14. Car Phone: the nose (left) and the lips (right) tracking error pro®les.

FACIAL MOTION ANALYSIS

13

Figure 15. Grandma: the left eye (left) and the right eye (right) tracking error pro®les.

Figure 16. Grandma: the nose (left) and the lips (right) tracking error pro®les.

Table 3. Tracking error results for Miss America and Claire Miss America (30 of 150 frames) Facial feature Left eye Right eye Nose Lips

Claire (34 of 168 frames)

Mean error [pixels]

Standard deviation [pixels]

Mean error [pixels]

Standard deviation [pixels]

0.624 0.844 0.569 0.489

0.678 0.686 0.596 0.582

0.407 0.638 0.784 1.018

0.533 0.741 0.553 0.621

Table 4. Tracking error results for Car Phone and Grandma Car Phone (80 of 400 frames) Facial features Left eye Right eye Nose Lips

Grandma (154 of 768 frames)

Mean error [pixels]

Standard deviation [pixels]

Mean error [pixels]

Standard deviation [pixels]

0.878 0.886 1.0594 1.663

0.715 0.781 0.873 1.042

0.856 0.829 0.372 0.731

0.600 0.648 0.528 0.670

14

P.M. ANTOSZCZYSZYN ETAL.

Table 5. Tracking error results for Salesman and Trevor Salesman (80 of 400 frames) Facial feature Left eye Right eye Nose Lips

Trevor (20 of 100 frames)

Mean error [pixels]

Standard deviation [pixels]

Mean error [pixels]

Standard deviation [pixels]

0.862 0.827 0.845 0.939

0.784 0.963 0.692 1.314

0.844 0.839 0.853 0.695

0.752 0.886 0.583 0.701

Figure 17. Tracking points for the left eye and the left eyebrow (left) and the lips (right) of the Candide model.

Figure 18. Tracking anchor vertices (left) used to manipulate the Candide model (right).

This assures continuity of tracking. The other advantage of a decrease in size of the images from the initial set is a considerable reduction of the computational load required to track the particular vertex. Using this approach, it was possible to track

closing and opening of lips and eyes. Again, observation of test video sequences recreated from the results of the algorithm showed good tracking performance. The tracked vertices were subsequently used as anchors for vertices of the Candide wire-frame

FACIAL MOTION ANALYSIS

15

head. The tracking of all important facial features was maintained in all cases. Both global and local motion were recovered and successfully used to drive the Candide wire-frame. The use of SVD allows tracking at speeds higher than 10 frames/s on an entry-level Pentium processor-based computer (120 MHz, 16 MB RAM).

Figure 19. Corners of the eyes and lips initial set extraction regions (Miss America, frame 0).

In future research we intend to reconstruct the tracked video sequence using texture mapping and to apply the techniques described here to implement a software-based videophone system, utilizing both signal processing and machine vision techniques. We also envisage application of the described algorithm in virtual reality systems.

Acknowledgement The authors wish to thank Professor Don Pearson of The University of Essex for valuable comments. Paul M. Antoszczyszyn acknowledges the support of The University of Edinburgh through a Postgraduate Studentship.

References

Figure 20. Miss America initial sets for the left and right eye corners (upper row) and corresponding eigenvector images (bottom row).

model (Figure 18, right) and the wire-frame model was driven by both the global motion of the speaker's head and the local motion of the facial features. Subjective observation showed this to operate very e€ectively.

Conclusions and Further Research A new and reliable algorithm for automatically tracking the motion of facial features in head-andshoulders scenes based on eigenvalue decomposition of sub-images containing facial features such as the eyes, the nose and the lips has been developed. The algorithm was tested on a range of sequences containing limited pan, rotation and zoom of the speaker's

1. ITU-T Draft H.263. (1995) Line transmission of nontelephone signals. Video coding for low bitrate communication. 2. Li, H., Lundmark, A. & Forchheimer, R. (1994) Image sequences coding at very low bitrates: a review. IEEE Transactions on Image Processing, 3: 589±609. 3. Gersho, A. (1982) On the structure of vector quantizers. IEEE Transactions on Information Theory, 28: 157±166. 4. Jacquin, A.E. (1992) Image coding based on a fractal theory of iterated contractive image transformations. IEEE Transactions on Image Processing, 1: 18±30. 5. Antonini, M., Barlaud, M., Mathieu, P. & Daubechies, I (1992) Image coding using wavelet transform. IEEE Transactions on Image Processing, 1: 205±220. 6. Musmann, H.G. (1995) A layered coding system for every low bit rate video coding. Signal Processing: Image Communication, 7: 267±278. 7. Musmann, H.G., HoÈtter, M. & Ostermann, J. (1989) Object-oriented analysis-synthesis coding of moving images Signal Processing: Image Communication, 1: 117±138. 8. Aizawa, K., Harashima, H. & Saito T. (1989) Model-based analysis synthesis image coding (MBASIC) system for a person's face. Signal Processing: Image Communication, 1: 139±152. 9. Forchheimer, R. & Kronander, T. (1989) Image coding ± from waveforms to animation. IEEE Transactions on Acoustic, Speech and Signal Processing, 37: 2008±2023.

16

P.M. ANTOSZCZYSZYN ETAL.

10. Welsh, B. (1991) Model-based coding of video images. Electronics and Comms. Eng. Journal, 3: 29±38. 11. Reinders, M.J.T., Van Beek, P.J.L., Sankur, B. Van der Lubbe, J.C.A. (1995) Facial feature localization and adaptation of a generic face model for model-based coding. Signal Processing: Image Communication, 7: 57±74. 12. Seferidis, V. (1991) Facial feature estimation for modelbased coding. Electronics Letters, 27: 2226±2228. 13. Antoszczyszyn, P.M., Hannah, J.M. & Grant, P.M. (1996) Automatic frame ®tting for semantic-based moving image coding using a facial code-book. In: Proceedings of VIII European Signal Processing Conference: EUSIPCI '96, II: 1369±1372. 14. Antoszczyszyn, P.M., Hannah, J.M. & Grant, P.M. (1996) Accurate automatic frame ®tting for semantic-based moving image coding. In: Proceedings of 1996 IEEE International Conference on Image Processing, ICIP '96, I: 689±692. 15. Li, H. & Forchheimer, R. (1994) Two-view facial movement estimation. IEEE Transactions on Circuits and Systems for Video Technology, 4: 276±287. 16. Kokuer, M. & Clark A.F. (1992) Feature and model tracking for model-based coding. Proceedings of 1992 IEE International Conference on Image Processing and Its Applications, pp. 135±138.

17. Antoszczyszyn, P.M., Hannah, J.M. & Grant, P.M. (1997) Facial features motion analysis for wire-frame tracking in model-based moving image coding. In: Proceedings of 1997 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '97, IV: 2669±2672 (accepted for publication). 18. Antoszczyszyn, P.M., Hannah, J.M. & Grant, P.M. (1997) Facial features model ®tting in semantic-based scene analysis. Electronics Letters, 33: 855±857. 19. Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24: 417±441 and 498±520. 20. Karhunen, K. (1946) UÈber lineare methoden in der wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae, Ser. A1, Math. Phys., 37. 21. LoeÁve, M. (1948) Fonctions AleÂatoires de second ordre. In: LeÂvy, ed. Processus stochastiques et mouvement Brownien. Paris, France: Hermann. 22. Murakami, H. & Kumar, V. (1982) Ecient calculation of primary images from a set of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 4: 511±515. 23. Turk, M. & Pentland, A. (1991) Eigenfaces for recognition. J Cog Neurosci, 3: 71±86.