Second-generation image coding

Second-generation image coding

ADVANCESIN IMAGINGAND ELECTRONPHYSICS,VOL. 112 Second-Generation Image Coding N . D . B L A C K , 1 R. J. M I L L A R , 2 M . K U N T , 3 M . R E I D...

17MB Sizes 5 Downloads 230 Views

ADVANCESIN IMAGINGAND ELECTRONPHYSICS,VOL. 112

Second-Generation Image Coding N . D . B L A C K , 1 R. J. M I L L A R , 2 M . K U N T , 3 M . R E I D , 4 a n d F. Z I L I A N I 3

l Information & Software Engineering, University of Ulster, Northern Ireland 2Computing & Mathematical Sciences, University of Ulster, Northern Ireland 3Swiss Federal Institute of Technology, Lausanne, Switzerland 4Kainos Software Ltd, Belfast, Northern Ireland

I. Introduction . . . . . . . . . . . . . . . . . . . . . . . II. Introduction to the Human Visual System . . . . . . . . . . III. Transform-Based Coding . . . . . . . . . . . . . . . . . A. Overview . . . . . . . . . . . . . . . . . . . . . . B. The Optimum Transform Coder . . . . . . . . . . . . C. Discrete Cosine Transform Coder . . . . . . . . . . . . D. Multiscale/Pyramidal Approaches . . . . . . . . . . . . . . . E. Wavelet-Based Approach . . . . . . . . . . . . . . . F. Edge Detection . . . . . . . . . . . . . . . . . . . . G. Directional Filtering . . . . . . . . . . . . . . . . . IV. Segmentation-Based Approaches . . . . . . . . . . . . . . A. Overview . . . . . . . . . . . . . . . . . . . . . . B. Preprocessing . . . . . . . . . . . . . . . . . . . C. Segmentation Techniques: Brief Overview ........ D. Texture Coding . . . . . . . . . . . . . . . . . . . E. Contours Coding . . . . . . . . . . . . . . . . . . . F. Region-Growing Techniques . . . . . . . . . . . . . . G. Split-and-Merge-Based Techniques . . . . . . . . . . . . . H. Tree/Graph-Based Techniques . . . . . . . . . . . . . . . . . . I. Fractal-Based Techniques . . . . . . . . . . . . . . . . . . . . V. Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 4 8 8 9 10 15 19 26 27 31 31 32 34 38 39 40 41 43 44 46 50

. . . .

I. INTRODUCTION

The

thirst

for digital

signal

compression

d e c a d e s , l a r g e l y as a r e s u l t o f t h e d e m a n d digital TV, commercial

has

grown

for consumer

over

the

l a s t few

p r o d u c t s , s u c h as

t o o l s , s u c h as v i s u a l i n s p e c t i o n s y s t e m s a n d v i d e o

c o n f e r e n c i n g , as w e l l as for m e d i c a l a p p l i c a t i o n s . As a r e s u l t a n u m b e r "standards"

have emerged

t h a t a r e in w i d e - s p r e a d

exploit some aspect of the particular reasonable

compression

image they are used on to achieve

rates. One such standard

o r i g i n a l l y d e v e l o p e d for t h e c o m p r e s s i o n

Volume 112 ISBN 0-12-014754-8

of

use t o d a y , a n d w h i c h is M - J P E G ,

which was

o f v i d e o i m a g e s . It d o e s t h i s b y

1 ADVANCESIN IMAGINGAND ELECTRONPHYSICSCopyright~ 2000by AcademicPress All rights of reproduction in any form reserved. ISSN 1076-5670/00$35.00

2

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

treating each image as a separate still picture. It works by taking "blocks" of picture elements and processing them using a mathematical technique known as the discrete cosine transform (DCT), resulting in a set of digital data representing particular aspects of the original image. These data are then subject to "lossless" compression to further reduce the size before transmission. The technique is very effective but, as one might expect, affects the resulting image quality to a certain degree. At high data rates for example, the process has the effect of enhancing picture contrast whereas at low data rates the process introduces "blocking" effects, which deteriorate the picture quality. Successive compression techniques often build upon previous designs, as is the case with the MPEG standard which encodes images rather like the M-JPEG standard but transmits information on the differences between successive image frames. In this way improved compression ratios can sometimes be achieved. The gain in compression is often at the expense of some other feature such as quality. The MPEG standard, for example, offers higher compression than M-JPEG but produces a recovered picture that is not only less sharp but introduces significant delays. The International Telecommunications Union (ITU) has defined a number of standards relating to digital video compression, all of which use the H.261 compression standard. This technique is specifically designed for low bandwidth channels and, as a result, does not produce images which could be considered of TV quality. Currently, the best compression techniques can produce about a 20:1 compression if the picture quality is not to be compromised. These "standards" are all based upon the so-called "First-Generation" coding techniques. All exploit temporal correlation through block-based motion estimation and compensation techniques, whereas they apply frequency transformation techniques (mainly discrete cosine transform, DCT) to reduce spatial redundancy. There is a high degree of sophistication in these techniques and a number of optimization procedures have been introduced that further improve their performances. However, the limits of these approaches have been reached and further optimizations are unlikely to result in drastically improved performance. First generation coding schemes are based on classical information theory (Hoffman, 1952; Golomb, 1966; Welch, 1977) and are designed to reduce the statistical redundancies present in the image data. These schemes exploit spatial and temporal redundancies in the video sequence at a pixe! level or at a fixed-size, block of pixels level. The various different schemes attempt to achieve the least possible coding rate for a given image distortion, and/or to minimize the distortion for a given bit rate. The compression ratios obtained with first generation lossless techniques are moderate at around

SECOND-GENERATION IMAGE C O D I N G

2:1. With lossy techniques a higher ratio (greater than 30:1) can be achieved but at the expense of image quality. The distortion introduced by the coding scheme is generally measured in terms of mean square error (MSE) between the original image and its reconstructed version. Although MSE is a simple measure of distortion that is easy to compute, it is limited in characterizing the perceptual level of degradation of an image. New image quality measures are necessary to second-generation image coding techniques. These will be introduced and discussed later in this paper. Second-generation image coding was first introduced by Kunt et al. (1985). The work stimulated new research aimed at further improvement in the compression ratios compared with those that were produced using existing coding strategies whose performances have now reached saturation level. The main limitation of first-generation schemes compared with the second-generation approach is that first-generation schemes do not directly take into account characteristics of the human visual system (HVS) and hence the way in which images are perceived by humans. In particular, first generation coding schemes ignore the semantic content of the image, simply partitioning each frame into artificial blocks. It is this that is responsible for the generation of strong visible degradation referred to as blockin9 artifacts, because a block can cover spatial/temporal nonhomogeneous data belonging to different entities (objects) in the scene. Block partitioning results in a reduced exploitation of spatial and temporal redundancies. In contrast, instead of limiting the image coding on a rigid block-based grid, second-generation approaches attempt to break down the original image into visually meaningful subcomponents. These subcomponents may be defined as image features and include edges, contours, and textures. These are known to represent the most relevant information to enable the HVS to interpret the scene content (Cornsweet, 1970; Jain, 1989; Rosenfeld and Kak, 1982) and need to be preserved as much as possible to guarantee good perceptual quality of the compressed images. Second-generation coding techniques minimize the loss in terms of human perception so that when decoded, the reconstructed image does not appear to be different from the original. For second-generation schemes, therefore, MSE is not sufficient as a measure of quality and a new criterion is required to correctly estimate the distortion introduced. Alternative to edges, contours and textures, the scene may be represented as a set of homogeneous regions or objects. This representation offers several advantages. First, each object is likely to present a high spatial and temporal correlation, improving the efficiency of the compression schemes much beyond the limits imposed by a block-based representation. Second, a description of the scene in terms of objects give access to a variety of

4

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

functions. For example, it is possible to assign the priority of each object and to distribute, accordingly, the available bit rate for the single frame. This functionality, referred to as scalability, enhances the quality of the objects of interest compared to those regions with less importance in the scene. Similarly, it is possible to apply for each object, according to its properties, the corresponding optimum coding strategy. This concept, referred to as dynamic coding (Ebrahimi et al., 1995), may further optimize the overall performances of the coding scheme. As suggested in Kunt (1998), future multimedia systems will strongly exploit all of these new fuctions, indeed some have already been introduced in the new video coding standard MPEG-4 (Ebrahimi, 1997). This chapter is organized into four main sections. In Section II a brief introduction to the human visual system is given in which characteristics that may be exploited in compression systems are discussed. Throughout the text, and where it is appropriate, additional and more specific material is referenced. Sections III and IV present the main body of the chapter and include transform-based techniques and segmentation-based techniques, respectively. Finally, we offer a Summary and Conclusions in Section V.

II. INTRODUCTION TO THE HUMAN VISUAL SYSTEM

We consider essentially two techniques for the coding of image information: transformation and segmentation. These techniques are essentially signal processing strategies, much of which can be designed to exploit aspects of the human visual system (HVS) in order to gain coding efficiency. Part of the process of imaging involves extracting certain elements or features from an image and presenting them to an observer in a way that matches their perceptual abilities or characteristics. In the case of a human observer, there are a number of sensitivities, such as amplitude, spatial frequency and image content, that can be exploited as a means of improving the efficiency of image compression. In this section, we shall introduce the reader to the HVS by identifying some of its basic features, which can be exploited fruitfully in coding strategies. We shall consider quantitatively those aspects of the HVS that are generally important in the imaging process. More explicit information on the HVS, as it relates to specific algorithms described in the text, is referenced throughout the text. The HVS is part of the nervous system and, as such, presents a complex array of highly specialized organs and biological structures. Like all other biological organs, the eye, which forms the input to the HVS, consists of highly specialized cells that react with specified yet limited functionality to

S E C O N D - G E N E R A T I O N IMAGE C O D I N G

input stimuli in the form of light. The quantitative measure of light power is luminance, measured in units of candela per meter squared (cd/m2). The luminance of a sheet of white paper reflecting in bright sunlight is about 30,000 cd/m 2 and that in dim moonlight is around 0.03 cd/m2; a comfortable reading level is a page that radiates about 30 cd/m 2. As can be seen from these examples, the dynamic range of the HVS, defined as the range between luminance values so low as to make the image just discernible and those at which an increase in luminance makes no difference to the perception, is very large and is, in fact, in the order of 80 dB. From a physical perspective, light enters the eye through the pupil, which varies in diameter between 2-9 mm, and is focused onto the imaging retina by the lens. Imperfections in the lens can be modeled by a two-dimensional (2D) lowpass filter while the pupil can be modeled as a lowpass filter whose cut-off frequency decreases with enlargement. The retina contains the neurosensory cells that transform incoming light into neural impulses, which are then transmitted to the brain and image perception occurs. It has two types of cells, cones and the slightly more sensitive rods. Both are responsible for converting the incoming light into electrical signals while compressing its dynamic range. The compression is made according to a nonlinear law of the form B = aL ~ where B represents brightness and L represents luminance. Both sensitivity and resolution characteristics of the HVS are largely determined by the retina. Its center is an area known as the fovea, which consists mainly of cones separated by distances large enough to facilitate grating resolutions of up to 50 cycles/degree of subtended angle. The spatial contrast sensitivity of the retina depends on spatial frequency and varies according to luminance levels. Figure 1, derived from Pearson, shows the results of this sensitivity at two luminance levels, 0.05 cd/m 2 and 500 cd/m 2. The existence of the peaks in Fig. 1 illustrates the important ability of the HVS in identifying sharp boundaries and further its limitations in identifying gradually changing boundaries. The practical effect of Figure 1 is that the HVS is very adept at identification of distinct changes in boundary, grayscales, or color, but important detail can easily be missed if the changes are more gradual. The rods and cones are interconnected in complex arrangements and this leads to a number of perceptual characteristics, such as lateral inhibition. Cells that are activated as a result of stimulation can be inhibited from firing by other activated cells that are in close proximity. The effect of this is to produce an essentially high-pass response, which is limited to below a radial spatial frequency of approximately 10 cycles/degree of solid angle, beyond which integration takes place. The combined result of lateral inhibition and the previously described processes make this part of the HVS behave as a linear model with a bandpass frequency response.

N. D. B L A C K , R. J. M I L L A R , M. K U N T , M. REID, A N D F. Z I L I A N I

Typical contrast sensitivity of the eye for sine-wave gratings

1000

>-, .,..,

9~-

130

O

.=~

|

O

o

O it)

d

O O

,'-'

O O

O O

t6 o

O O

~,

c; c;

Spatial Frequency (cy cles/degree) FIGURE 1. Typical contrast sensitivity of the eye for sine-wave gratings. Evidently the perception of fine detail is dependent on luminance level; this has practical implications, for example on the choice between positive and negative modulation for the display of particular types of image.

The majority of the early work on vision research used the frequency sensitivity of the HVS as described by the modulation transfer function (MTF). This characterizes the degree to which the system or process can image information at any spatial frequency. The MTF is defined for any stage of an imaging system as the ratio of the amplitudes of the imaged and the original set of spatial sine-waves representing the object which is being imaged, plotted as a function of sine-wave frequency. Experiments by Manos and Sakrison (1974) and Cornsweet (1970) propose a now commonly used model for this function, which relates the sensitivity of the eye to sine-wave gratings at various frequencies. Several authors have made use of these properties (Jangard Rajola, 1990, 1991; Civinlar et al., 1986) in the preprocessing strategies, particularly when employing segmentation where images are preprocessed to take account of the HVS's greater sensitivity to gradient changes in intensity and threshold boundaries (Civinlar et al., 1986; Marqu6s et al., 1991).

S E C O N D - G E N E R A T I O N IMAGE C O D I N G

Dependenoe of 0.001

thresholdcentrast

-~~

~176176 i

/

i

J

o.,oorly II

m~"'='"

..... f -

.,..j,..w--

,,,/,.......,....

./jl

I

m

_.._-------

j.--

I

1.ooo

7

.............

/,j

;,.,.:.'/

.I

...........................................................................

I.., ......-/ ! 10.000 1

2

5

10 20

50

Subtended angle (mrad) FIGURE 2. Dependence on threshold contrast ocL/L (the Weber ratio) on size of an observed circular disc object, for two levels of background luminance, with zero noise and a 6 s viewing time. The effect of added noise and/or shorter viewing time will generally be to increase threshold contrast relative to the levels indicated here (after Blackwell (1996)).

A considerable amount of research work has been carried out to determine the eye's capability in contrast resolution (the interested reader is referred to Haber and Hershenson, 1973, for detailed information). Contrast resolution threshold is given as ocL/L where L is the luminance level of a given image and ocL is the difference in luminance level that is just noticeable to an observer. The ratio, known as the Weber ratio, is a function of light falling on the retina and can vary considerably. Under ideal conditions, plots of Weber's ratio indicate that the eye is remarkably efficient at contrast resolution between small differences in grayscale level. The response of the HVS is a function of both spatial and temporal frequency and is shown in Fig. 2. Measurements by Blackwell (1996) on the joint spatiotemporal sensitivity indicate that this joint sensitivity is not

N. D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

separable. As shown in Figure 2, distinct peaks in individual bandpass characteristics appear in both cases. In the following sections of this paper, algorithms derived to facilitate the coding of images will exploit aspects of the HVS with the intention of improving coding efficiency. For lossless transformations the efficiency does not always result in significant compression but the tasks in the coding sequence such as quantization and ordering are made easier if the properties of the HVS can be properly exploited.

I I I . TRANSFORM-BASED CODING

A. Overview

This section introduces and describes some of the most frequently used coding techniques, based on image transformation. Basically it consists of two successive steps: image decomposition/transformation; and quantization/ordering of the transform coefficients. The general structure of all of these techniques is summarized in Figure 3. The basic idea exploited by these techniques is to find a more compact representation of the image content. This is initially achieved by applying a decomposition/transformation step; different decompositions/transformations can be applied. Most transformation techniques (discrete cosine transform, pyramidal decompo-

FIGURE 3. A generic transform coding scheme. The image is first transformed according to the chosen decomposition/transformation function. Then a quantization step, eventually followed by a reordering, provides a series of significant coefficients that will be converted into a bit-stream after a bit assignment step.

SECOND-GENERATION IMAGE C O D I N G

sition, wavelet decomposition, etc.) distinguish low frequency contributions from high frequency contributions. This is a first approximation of what happens in the HVS as was described in Section II. More accurate transformations from an HVS model point of view are applied in the directional-filtering-based techniques (see Section III.G). Here frequency responses in some preferred spatial directions are used to describe the image content. Generally all of these transformations are lossless, thus they do not necessarily achieve a significant compression of the image. However, the resultant transformed image has the property of highlighting the features that are significant in the HVS model, thus easing the task of quantizing and ordering the obtained coefficients according to their visual importance. The real compression step is obtained in the quantization/ordering step and the following entropy-based coding. Here the continuous transform coefficients are first projected into a finite set of symbols each representing a good approximation of the coefficient values. Several quantization methods are possible from the simplest uniform quantization to the more complex vector quantization (VQ). A reordering of nonzero coefficients is generally performed after the quantization step. This is done to better exploit the statistical occurrence of nonzero coefficients to improve the performances of the following entropic coding. In a second generation coding framework, this ordering step is also responsible for deciding which coefficients are really significant and which can be discarded with minimum visual distortion. The criteria used to perform this choice are based on the HVS model properties and they balance the compromise between quality of the final image and compression ratio. The next section will review some of the most popular coding techniques that belong to this class, identifying the properties that make them second generation coding techniques. First a brief introduction on general distortion criteria is presented in Section III.B in order to define the optimum transform coder. Then the discrete cosine transform will be discussed in Section III.C. Multiscale and pyramidal approaches are introduced in Section III.D and wavelet-based approaches are discussed in Section III.E. Finally, techniques that make use of extensive edge information will be reviewed in the last two sections, III.F and III.G.

B. The Optimum Transform Coder In a transform-based coding scheme, the first step consists in transforming pixel values to exploit redundancies and to improve the compression rates

10

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

of successive entropy encoding. Once the optimality criterion is defined, it is possible to find an optimum transform for that particular criterion. In the framework of image coding, the most commonly used optimality criterion is defined in terms of mean square distortion (MSD), also referred to as mean square error (MSE) between the reconstructed and original images. For such a criterion, it has been shown that an optimum transform exists (Schalkoff, 1989; Burt and Adelson, 1983) in the Karhunen-Lo6ve (KL) transform (Karhunen, 1947; Lo6ve, 1948). The KL transform depends on critical factors such as the second-order statistics as well as the size of the image. Due to these dependencies, the basis vectors are not known analytically and their definition requires heavy computation. As a result, the practical use of the KL transform in image coding applications is very limited. Although Jain (1976) proposed a fast algorithm to compute the KL transform, his method is limited to a specific class of image models and thus is not suitable for a general coding system. Fortunately, a good approximation of the KL transform that does not suffer from complexity problems exists as the discrete cosine transform presented in Section III.C. It is important to note that from an HVS point of view, the MSE criterion is not necessarily optimal. Other methods have been considered and are currently under investigation for measuring the visual distortion introduced by a coding system (van den Branden Lambrecht, 1996; Winkler, 1998; Miyahara et al., 1992). They take into account the properties of the HVS in order to define a visual distance between the original and the coded image and thus to assess the image quality of that particular compression (Mannos and Sakrison, 1974). These investigations have already shown improvements in standard coding systems (Westen et al., 1946; Osberger et al., 1946). Future research in this direction may provide more efficient criteria and the corresponding new, optimum, transforms that could improve the compression-ratio without loss in visual image quality.

C. Discrete Cosine Transform Coder The discrete cosine transform (DCT) coder is one of the most used in digital image and video coding. Most of the standards available today, from JPEG to the latest MPEG-4, are based on this technique to perform compression. This is due to the good compromise between computational complexity and coding performance that the DCT is able to offer. A general scheme for a coding system based on DCT is presented in Fig. 4. The first step is represented by a Block Partitioning of the image. This is divided into N x N pixel blocks f [ x , y ] , where N needs to be defined.

SECOND-GENERATION

IMAGE CODING

....., ..~



{D

.,..~

.,..,

9,...,

9

,xZ

N

,-.,

9

..X:Z

9

.,..a

N

.,..,

9

.,..o

~. ~ ' ~

11

12

N . D . BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

Typical values for N are 8 or 16. A larger block size may lead to more efficient coding, as the transform may access higher correlated data from the image; however, larger blocks also increase the computational cost of the transform, as will be explained. Better compression efficiency can be achieved by using a combination of blocks of different shapes as suggested by Dinstein et al. (1990). Clearly, this method increases the overall complexity of the coding scheme. In the international standard JPEG, N has been chosen equal to eight; thus the following examples will use the same value. Once the Block Partitionin9 step is performed each block is coded independently from the others by applying the DCT. The result of this step is a block, F[u, v], of N x N transformed coefficients. Theoretically, the 2D DCT block, F[u, v], of the N x N image block, fix, y-], is defined according to the following formula: 2

N-1

I:[u, v] =-~ C(u)C(v) Z

N-1

E fix, y] cos

x=0 y=0

((2x + 1)un:'] (2y + 1)vrc) ~-N j cos 2N

where ,z=0

c,z, for z = u and z = v. Its inverse transform is then given by 2

f[x, y] = -~

N IN, (!2X +' ~ C(u)C(v)FEu,v] c o s ,=o ~=o

2N

cos

(2y + 1)wz) 2N

with C(u) and C(v) defined as before. Intuitively, DCT coefficients represent the spatial frequency components of the image block. Each coefficient is a weight that is applied to an appropriate basis function. In Fig. 5 we display the basis functions for an 8 x 8 DCT block. The DCT has some very interesting properties. First, both forward and inverse DCT are separable. Thus, instead of computing the 2D transform, it is possible to apply a one-dimensional (1D) transform along all the rows of the block and then down the columns of the block. This reduces the number of operations to be performed. As an example, a 1D 8-point DCT will require an upper limit of 64 multiplications and 56 additions. A 2D 8 x 8-point DCT considered as a set of 8 rows and 8 columns would require 1024 multiplications and 896 additions. Secondly, it can be observed that the transform kernel is a real function. In coding, this is an interesting property because only the real part for each

SECOND-GENERATION

IMAGE CODING

13

FIGURE 5. The basis function for an 8 x 8 D C T block.

transform coefficient needs to be coded. This is not necessarily true for other transformations. Finally, fast transform techniques (Chen et al., 1977; Narasimha and Peterson, 1978) that take advantage of the symmetries in the DCT equation can further reduce the computational complexity for this technique. For example, the cosine transform for an N x 1 vector can be performed in O(N log 2 N) operations via an N-point FFT. Computational efficiency is not the only important feature of the DCT transform. Of primary importance is its role in performing a good energy compaction of the transform coefficients. The DCT also performs this well; it is verified that in practice the DCT tends towards the optimal KL transform for highly correlated signals such as natural images that can be modeled by a 1st order Markov process (Caglar et al., 1993). Once the DCT has been computed, it is necessary to perform a quantization of the DCT coefficients. At this point of the scheme, no compression has been performed; the quantization procedure will introduce it. Quantization is an important step in coding and, again, it can be performed in several ways. Its main role is to minimize the average distortion introduced by fixing the desired entropy rate (Burt and Adelson, 1993). In practice, it makes more values look the same, so that the subsequent entropy-based coding can improve its performance while coding the DCT coefficients.

14

N. D. BLACK, R. J. MILLAR, M. K U N T , M. REID, A N D F. Z I L I A N I TABLE 1 EXAMPLE OF J P E G QUANTIZATION TABLE 16

11

10

16

24

40

51

61

12

12

14

19

26

58

60

55

14

13

16

24

40

57

69

56

14

17

22

29

51

87

80

62

18

22

37

56

68

109

103

77

24

35

55

64

81

104

113

92

49

64

78

87

103

121

120

101

72

92

95

98

112

100

103

99

As discussed in Section III.B a distortion measure can be defined, either in the sense of an MSE or, in the more interesting, HVS sense. State-of-theart coding techniques are in general based on the former. In this context, it appears that the uniform quantizer is optimal or quasi-optimum for most of the cases (Burt and Adelson, 1993). A quantizer is said to be uniform when the same distance uniformly separates all the quantization thresholds. The simplicity of this method and its optimal performance are the reasons why the uniform quantizer is so widely used in coding schemes and in standards such as JPEG and MPEG. In particular, in these standards, a quantification table is constructed to define a quantization step for each DCT coefficient. Table 1 shows the quantization table used in the JPEG standard. Each DCT coefficient is divided by the corresponding quantization step value defining dynamically their influence. From a perceptual point of view the MSE optimality criterion is not relevant. Therefore in second generation coding techniques other techniques are proposed such as the one described by Macq (1989) and van den Branden (1996). After the quantization is performed, a reordering of the DCT coefficients in a zig-zag scanning order represents the successive step. This procedure starts parsing the coefficient from the upper left position, which represents the DC coefficient, to the lower right position of the DCT block, which represents the highest frequency, AC, coefficient. The exact order is represented in detail in Fig. 4.

SECOND-GENERATION IMAGE CODING

15

The zig-zag reordering is justified by a hypothesis on the knowledge of both natural images and HVS properties. In fact it is known that most of the energy in a natural image is concentrated in the DC components, where AC components are both less likely to occur and less important from a visual point of view. This is the reason why we can expect that most of the nonzero coefficients will be concentrated in the DC components. Ordering the coefficients in such a way that all zero or small coefficients are concentrated at the end improves the performance of entropy-based coding techniques by generating a distribution as far as possible from the uniform distribution. In Fig. 6 the results obtained by applying a JPEG-compliant DCT-based compression scheme on the Lena image are shown. The image is in a QCIF format (176 • 144 pixels) and is shown in both color and black and white format. Four different visual quality results are represented. Each corresponds to a different compression ratio as indicated. The higher the compression ratio, the worse the visual quality, as might be expected. Blocking artifacts are evident at very low bit rates, highly degrading the image quality.

D. Multiscale/Pyramidal Approaches Multiscale and pyramidal coding techniques represent an alternative to the block quantization approach based on DCT (see Section III.C). Both approaches perform a transformation and a filtering of the image in order to compact the energy and improve the coding performances. However, multiscale/pyramidal coding techniques operate on the whole original image instead of operating on limited dimension blocks. In particular, the image is filtered and subsampled in order to produce various levels of image detail at progressively smaller details. An interesting property of this approach, when compared with the DCT, is the possibility for progressive transmission of the image, as will be described later. Moreover, the fact that no blocks are introduced avoids the generation of blocking artifacts, which represents one of the most annoying drawbacks in the DCT-based coding techniques. Multiresolution approaches have recently been of great interest to the video coding research community. From a complexity point of view, the approach of coding an image through successive approximation is often very efficient. From a theoretical point of view, it is possible to discover amazing similarities with the HVS models. In fact, experimental results have shown that the HVS uses a multiresolution approach (Schalkoff, 1989) in completing its tasks. Researches suggest that multifrequency channel decomposition seems to take

FIGURE 6. Visual performances at different bit rates of a JPEG compliant, DCT-based compression scheme. The top two images are the original images. The following represent the results obtained with different bit rates. 16

S E C O N D - G E N E R A T I O N IMAGE C O D I N G

17

place in the human visual cortex, thereby causing the visual information to be processed separately according to its frequency band (Wandell, 1995). Similarly, the retina is decomposed into several frequency band sensors with uniform bandwidth on an octave scale. All of these considerations justify the keen interest shown by the researchers in this direction. In 1983, Burt and Adelson (1983) presented a coding technique based on the Laplacian pyramid. In this approach a lowpass filtering of the original image is performed as first step. This is obtained by applying a weighted average function (H,). Next a down-sampling of the image is performed. These two steps are repeated in order to produce progressively smaller images, in both spatial intensity and dimension. All together, the results of these transformations represent the Gaussian Pyramid represented in Figure 7 by the three top images G1, G z, and G 3. Each level in the Gaussian Pyramid, starting from the lowest, smallest level, is interpolated to the size of its predecessor in order to produce the Laplacian Pyramid. In terms of the coding, it is the Laplacian Pyramid, instead of the image itself, which is coded. As in the DCT-based coding method, the original image has been transformed to a specific structure in which each level of the pyramid has a different visual importance. The smallest level of the Gaussian Pyramid represents the roughest representation of the image. If greater quality is required, then successive levels of the Laplacian Pyramid need to be added. If the complete Laplacian Pyramid is available, a perfect reconstruction of the image is possible through the process of adding with appropriate interpolation all the different levels from the smallest resolution to the highest. This structure makes a progressive transmission particularly simple. As in the DCT-based approach, the real coding process is represented by the successive step: the quantization of each level of the pyramid. Again, a uniform quantization is the technique preferred by the authors. They achieve this by simply dividing the range of pixel values into bins of a set width: quantization then occurs by representing each pixel value that occurs within the bin by the bin centroid. Different compression ratios can be achieved by increasing or decreasing the amount of quantization. As before, there is a trade-off between high compression and visual quality. Butt and Adelson (1983) attempted to exploit areas in the image that are largely similar. These similar areas appear at various resolutions, hence, when the subsampled image is expanded and subtracted from the image at the next higher resolution, the difference image (L,) contains large areas of zero, indicating commonality between the two images. These areas can be noticed in Fig. 7 as the dark zones in L1 and L 2. The larger the degree of commonality, the greater the amount of zero areas in the difference

18

ea3

,.=

"

""

N . D . BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

,-.-,

U_.

.,.a

SECOND-GENERATION IMAGE CODING

19

images. Standard first-generation coding methods can then be applied to the difference images to produce good compression ratios. With very good image quality, compression ratios of the order of 10:1 are achievable. Other techniques exist that are based on a similar pyramidal approach. Of particular interest in a second generation coding context are those based on mathematical morphology (Salembier and Kunt, 1992; Zhou and Venetsanopoulos, 1992; Toet, 1989). These techniques provide an analysis of the image based on object shapes and sizes; thus, they include features that are relevant to the HVS. The advantage of these techniques is that they do not suffer from ringing effects (cf. Section III.G) even under heavy quantization. However, these techniques produce residual images still with large entropy, thus not efficiently compressible through first generation coding schemes. Moreover, the residual images obtained are the same size as the original image. These drawbacks do not allow for a practical application of these techniques in image coding. A more detailed discussion on these techniques will be presented in Section III.G.

E. Wavelet-Based Approach Although the wavelet transform-based coding approach is a generalization of multiscale/pyramidal approaches, it deserves to be treated separately. The enormous success it has obtained in the image coding research community and its particular compatibility with the second generation coding philosophy provide the rationale for a more extensive discussion of this category in this overview. Moreover, the future standard for still image compression, JPEG2000, will be based on the wavelet coding system. The wavelet transform represents the most commonly used transform in the current domain of research: subband coding (SBC). The idea is similar to that for the Gaussian Pyramid already described here, but much more general. Using the wavelet tranforms, instead of computing only a lowpass version of the original image, a complete set of subbands is computed by filtering the input image with a set of bandpass filters. In this way, each subband directly represents a particular frequency range of the image spectrum. A primary advantage of this transform is that it does not increase the number of samples over that of the original image, whereas pyramidal decompositions do. Moreover, wavelet-based techniques are able to efficiently conserve important perceptual information like edges, even if their energy contribution to the entire image is low. Other transform coders, like the one based on DCT, decompose images into representations where each coefficient corresponds to a fixed-size spatial area and frequency band. Edge

20

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

information would require many nonzero coefficients to represent them sufficiently. At low bit rates other transform coders allocate too many bits to signal behavior that is more localized in the time or space domain and not enough bits to edges. Wavelet techniques therefore offer benefits at low bit rates since information at all scales is available for edges and regions (Shapiro, 1993). Another important characteristic of wavelets in a second generation coding framework is that each subband can be coded separately from the others. This provides the possibility of allocating the total bit rate available according to the visual importance of each subband. Finally, SBC does not suffer from the annoying blocking artifacts reported in the DCT coders. However, it does suffer from an artifact specific to wavelet transfer, that of ringing. This effect occurs mainly around highcontrast edges and is due to the Gibbs phenomenon of linear filters. The effect of this phenomenon varies according to the specific filter bank used for the decomposition. Taking into account properties of the HVS, SBC, and, in particular wavelet transforms makes it possible to achieve high compression ratios with very good visual quality images. Moreover, as with the pyramidal approach proposed by Burt and Adelson (1983), they permit a progressive transmission of the images by the hierarchical structure they possess. As a general approach, the concept of subband decomposition on which wavelets are based was originally introduced in the speech-coding domain by Crochiere et al. (1976) and Croisier et al. (1976). Later, Smith and Barnwell (1986) proposed a solution to the problem of perfect reconstruction for a 1D multirate system. In 1984, Vetterli (1984) extended perfect reconstruction filter banks theory to bidimensional signals. In 1986 Woods and O'Neil (1986) proposed the 2D separable quadrature mirror filter (QMF) banks that introduced this theory in the image-coding domain. The most currently used filter banks are the Q M F proposed by Johnston (1980). These are 2-band filter banks that are able to minimize a weighted sum of the reconstruction error and the stopband energy of each filter. Fig. 8 represents a generic scheme for 2-band filter banks. As they exhibit linear phase characteristics these filters are of particular interest to the research community; however they do not allow for perfect reconstruction. An alternative is represented by the conjugate quadrature filter (CQF) proposed by Smith and Barnwell (1986). These allow for perfect reconstruction, but do not have linear phase. M-band filters also exist as an alternative to quadrature filters (Vaidyanathan, 1987); however, the overhead introduced by their more complex design and computation has not helped the diffusion of these filters in the coding domain. Finally, some attempts to define filter banks that take further account of HVS properties have been pursued by Caglar et al. (1993) and Akansu et al. (1993).

SECOND-GENERATION IMAGE CODING

21

Analysis _ X(z)

_

~

~

X,(z)

~ ~ - - - - ~

X2(z)

Synthesis

X2(z) ~

~

-

~

Y(z)

FIGURE 8. Generic scheme for a 2-band analysis/synthesis system. H I and G 1 are, respectively, the analysis and synthesis lowpass filter. H 2 and G 2 represent the equivalent high-pass filters. Perfect reconstruction is achievable when Y(z) is a delayed version of X(z).

Among the complete set of subband filters that have been developed, an important place in second generation image coding is represented by the wavelet decomposition. This approach takes into consideration the fact that most of the power in natural images is concentrated in the low frequencies. Thus a finer partition of the frequency in the lowpass band is performed. This is achieved by a tree structured system as represented in Fig. 9. A wavelet decomposition is a hierarchical approach: at each level the available frequency band is decomposed into four subbands using a 2-band

HI 11

X

Ltl l

HH2 ItL 1 I,,[t2 LLI

HL} LL,~

FIGURE 9. A depth 2 wavelet decomposition. On the right, the 2-level tree structure is represented. X is the original image. L H x represents a first level, lowpass filter in the horizontal direction and high-pass filter in the vertical direction. The other filter definitions obey the same convention. Since each filtering is followed by a down-sampling step, the complete decomposition can be represented with as many coefficients as the size of the original image, as shown in the left-hand side of this figure.

22

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

FIGURE 10. An example of wavelet decomposition. The filter used to decompose the image "Lena" (512 • 512) is a 2-level biorthogonal Daubechies' 9/7 filter.

filter bank applied both to the lines and to the columns. This procedure is repeated until the energy contained in the lowest subband (LL,) is less then a prefixed threshold, determined according to an HVS model hypothesis. In Fig. 10 the results of applying wavelet decomposition on the test image Lena are represented. A generic scheme for a wavelet transform coder is represented in Fig. 11. Several different implementations reported in the literature differ according to the wavelet representation, the method of quantization, or the final entropic encoder.

Original H EncOder Image __~ .Wavelet . . . . . . . . . . Transform . . . . . . . . . . . . .H. . . .Quantification ..... ~;............ 9giorthogonal Wavelets [ I.Vcctor 9Wavelet packets [ ! .Frequency __,M.ultiwavelets

[

t ~ Bit stream

l i .Run-length -Arithmetic

FIGURE 11. A generic scheme for wavelet encoders. First a wavelet representation of the image is generated, then a quantification of the wavelet coefficients has to be performed. Finally an entropic encoder is applied to generate the bit-stream. Several choices for each step are available. Some of the most common ones are listed in the figure.

S E C O N D - G E N E R A T I O N IMAGE C O D I N G

23

Among the different families of wavelet representations, it is worth noting the compactly supported orthogonal wavelets. These wavelets belong to the more general family of orthogonal wavelets that generate orthonormal bases of Lz(Rn). The important feature of the compactly supported orthogonal wavelets is that in the discrete wavelet transform (DWT) domain they correspond to finite impulse response (FIR) filters. Thus they can be implemented efficiently (Mallat, 1989; Daubechies, 1993, 1998). In this family the Daubechies' Wavelets and Coifman's Wavelets are popular. An important drawback of compactly supported orthogonal wavelets is their asymmetry. This is responsible for the generation of artifacts at the borders of the wavelet subbands as reported by Mallat (1989). To avoid this drawback, he has also investigated noncompact orthogonal wavelets; however they do not represent an efficient alternative due to their complex implementation. An alternative wavelet family that presents symmetry properties is the biorthogonal wavelet representation. This wavelet representation also offers efficient implementations and thus it has been adopted in several wavelet image coders. The example represented in Figure 10 was generated using a wavelet belonging to this family. There has been some work carried out in an attempt to define methods that identify the best wavelet basis for a particular image. In this framework a generalized family of multiresolution orthogonal or biorthogonal bases that includes wavelets has been introduced; these are regrouped according to Lu et al. (1996) in the wavelet packets family. Different authors have proposed entropic or rate-distortion based criteria to choose the best basis from this wide family (Coifman and Wickerhausen, 1992; Ramchandran and Vetterli, 1993). In a second generation image coding framework, of particular interest is the research carried out on the zero-crossings and local maxima of wavelet transforms (Mallat, 1991; Froment and Mallat, 1992). These techniques directly introduce in the wavelet framework the concept of edges and contours, so important in the HVS (Croft and Robinson, 1994; Mallat and Zhong, 1991). More detail on this approach will be given in Section III.F. The choice of the wavelet to be used is indeed a key issue in designing a wavelet image coder. The preceding short discussion shows that many different choices are available: not all directly take into account HVS considerations. These can, however, be introduced in the subsequent quantization step of the coding process. As was discussed, a wavelet representation generates, for each image, a number of 3D + 1 subbands, where D represents the levels of the decomposition (dyadic scales). Each subband shows different statistical behaviors; thus it is important to apply an optimized quantization for each of them.

24

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

As already discussed in Section III.C for the DCT transform, and as reported by Jain (1989), the uniform quantizer is a quasi-optimum solution for an MSE criterion. In this case we simply need to define for each subband a quantizer step. Note that this solution is similar to the one used in the JPEG standard, where each coefficient in the DCT block is associated with a different quantization step (see Table 1). This choice is the one used in practice by a well-known software package, EPIC (Simoncelli and Adelson). In this case an initial step size is defined and divided by a factor of two as one goes to the next coarser step in the wavelet decomposition. Thus the lowest subband that provides most of the visual information is finely quantized with the smallest step size. Other methods increase the compression by mapping small coefficients in the highest frequency bands to zero. Research has also been performed aimed at the design of HVS-based quantizers. In particular Lewis and Knowles (1992) designed a quantizer that considers the HVS's spectral response, noise sensitivity in background luminance, and texture masking. For scalar quantization, the uniform quantization performs well; other alternatives are represented by the vector quantization (VQ) methods. Generally VQ performs better then SQ as discussed in Senod and Girod (1992). The principle is to quantize vectors or blocks of coefficients instead of the coefficient itself. This generalization of the SQ takes into account the possible correlation between coefficients, already at the quantization step. Cicconi et al. (1994) describe a Pyramidal Vector Quantization that takes into account correlation between subbands that belong to the same frequency orientations. Thus both intra- and interband correlation are taken into account during the quantization process. In the same contribution the authors also introduce a criterion for a perceptual quantization of the coefficients, which is particularly suited to second generation image coding techniques. Another possible solution in wavelet coders is represented by a successiveapproximation quantization. In this category, it is important to cite the method proposed by Shapiro (1993): "embedded zerotree wavelet algorithm" (EZW). This method tries to predict the absence of significant information across the subbands generated by the wavelet decomposition. This is achieved in defining a zerotree structure. Starting from the lowest frequency subband, a father-children relationship is defined recursively through all the following subbands as represented in Fig. 12. Basically the quantization is performed by successive approximation across the subbands with the same orientation. Similar to the zig-zag scanning reported in Section III.C, a scanning of the different subbands as shown in Fig. 13 is performed. This strategy turns out to be an efficient technique to code zero and nonzero quantized values.

SECOND-GENERATION IMAGE CODING

FIGURE 13. Zero-tree scanning order for a 3-scale QMF wavelet decomposition.

25

26

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

Research by both Said and Pearlman (1996) and Taubman and Zakhor (1994) based on the same principle developed by Shapiro, provided even better coding performances. Their new techniques are known as set partitioning in hierarchical trees (SPIHT) (Said and Pearlman, 1996) and layered zero coefficient (LZC) (Taubman and Zakhor, 1994). Recently, new efforts have been devoted to the improvement of these coding techniques with special attention to both HVS properties and color components (Lai and Kuo, 1998a,b; Nadenau and Reichel, 1999). An interesting example is represented by the technique proposed by Nadenau and Reichel (1999). This technique is based on an efficient implementation of the LZC method (Taubman and Zakhor, 1994). It applies the lifting steps approach, presented by Daubechies and Sweldens (1998) in order to reduce the memory and the number of operations required to perform the wavelet decomposition. It also performs a progressive coding based on the HVS model and includes color effects. The HVS model is used to predict the best possible bits allocations during the quantization step. In particular, the color image is converted into the opponent color space discussed by Poirson and Wandell (1993, 1996): this representation reflects better the usual YCbCr representation, the properties of color perception in the HVS model. Finally, this technique produces a visually embedded bit-stream. This means that not only the quality improves as more bytes are received and that the transmission can be stopped at anytime, but that the partial results are always coded with the best visual quality.

F. Edge Detection Mallat and Zhong (1991) point out that in most cases structural information required for recognition tasks is provided by the image edges. However, one major difficulty of edge-based representation is to integrate all the image information into edges. Most edge detectors are based on local measurements of the image variations and edges are generally defined as points where the image intensity has a maximum variation. Multiscale edge detection is a technique in which the image is smoothed at various scales and edge points are detected by a first- or second-order, differential operator. The coding method presented involves two steps. First, the edge points considered important for visual quality are selected. Second, these are efficiently encoded. Edge points are chained together to form edge curves. Selection of the edge points is performed at a scale of 22. This means that the edge points are selected from the image in the pyramidal structure that has been scaled to a factor of four. Boundaries of important structures often

SECOND-GENERATION IMAGE CODING

27

generate long edge curves, so, as a first step, all edge curves whose lengths are smaller than a threshold are removed. Among the remaining curves, the ones that correspond to the sharpest discontinuities in the image are selected. This is achieved by removing all edge curves along which the average value of the wavelet transform modulus is smaller than a given amplitude threshold. After the removal procedures, it is reported that only 8% of the original edge points are retained; however it is not clear if this figure is constant for all images. Once the selection has been performed, only the edge curves at scale 22 are coded in order to save bits; the curves at other scales are approximated from this. Chain coding is used to encode the edge curve at this scale. The compression ratio reported by Mallat and Zhong (1991) with this method is approximately 27:1 with good image quality. G. D i r e c t i o n a l Filtering

Directional filtering is based on the relationship between the presence of an edge in an image and its contribution to the image spectrum. It is motivated by the existence of direction-sensitive neurons in the HVS (Kunt et al., 1985; Ikonomopoulos and Kunt, 1985). It can be seen that the contribution of an edge is distributed all over the spectrum; however, the highest frequency component lies in the direction orthogonal to that of the edge. It can also be seen that the frequency of the contribution diminishes as we turn away from this direction, until it vanishes at right angles to it. A directional filter is one whose frequency response covers a sector or part of a sector in the frequency domain. If f and g are spatial frequencies and r is the cut-off frequency of the lowpass filter, then the ideal frequency response of the ith directional filter of a set of n is given by: Gli(f' g) =

1, if 0 i < tan-1(g/f) < Oi+ 1 O, otherwise

with O, = (i - 1)~n,

1)i+1 -- (i + 1) 2n

and If I, ]gl < 0.5. A directional filter is a high-pass filter along its principal direction and a lowpass filter along the orthogonal direction. The directional filter response is modified, as in all filter design, by an appropriate window function

28

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

(Harris, 1978), to minimize the effect of the Gibbs phenomenon (Ziemer et al., 1989). In the sum of a trigonometric series, it can be seen that there tend to be overshoots in the signal being approximated at a discontinuity. This is referred to as the Gibbs phenomenon. An ideal filter can be viewed as a step or rectangular pulse waveform, that is, a discontinuous waveform. The reason for the overshoot at discontinuities can be explained using the Fourier transform. Consider a signal x(t) with a Fourier transform X ( f ) . The effect of reconstructing x(t) from its lowpass part shows that:

where 1-I

=

{1o

otherwise.

According to the convolution theorem of Fourier transform theory, = x(t) 9 (2 W sinc 2 Wt).

Yc(t) = x(t) 9 F -1

Bearing in mind that convolution is a folding-product, sliding-integration process, it can be seen that a finite value of W will always result in x(t) being viewed through the sinc window function; even as W increases, more of the frequency content of the rectangular pulse will be used in the approximation of x(t). In order to eliminate the Gibbs phenomenon it is important to modify the frequency response of the filter by a window function. There are many window functions available, each with different frequency responses. The frequency response of the chosen window function is convolved with the filter response. This ensures that the overall frequency response does not contain the sharp discontinuities that cause the ripple. In a general scheme using directional filters, n directional filters and one lowpass filter are required. An ideal lowpass filter has the following frequency response: G,,(f, g) =

1, 0,

if

f2 +

g2 < rc2

otherwise.

It should be noted that superposition of all the directional images and the lowpass image lead to an exact reconstruction of the original image. Two parameters are involved in the design of a directional filter-based image coding scheme: the number of filters and the cutoff frequency of the lowpass filter. The number of filters may be set a priori and is directly related to the

S E C O N D - G E N E R A T I O N IMAGE C O D I N G

29

minimum width of the edge elements. The choice of lowpass cutoff frequency influences the compression ratio and the quality of the decoded image. As reported by Kunt et al. (1985) a very early technique in advance of its time was the synthetic highs system (Schreiber et al., 1959; Schreiber, 1963). It is stated by Kunt that the better known approach of directional filtering is a refinement of the synthetic highs system. In this technique, the original image is split into two parts: the lowpass picture showing general area brightness and the high-pass image containing edge information. Twodimensional sampling theory suggests that the lowpass image can be represented with very few samples. In order to reduce the amount of information in the high-pass image, thresholding is performed to determine which edge points are important. Once found, the location and magnitude of each edge point is stored. To reconstruct the compressed data, a 2D reconstruction filter, whose properties are determined by the lowpass filter used to produce the lowpass image, is used to synthesize the high frequency part of the edge information. This synthesized image is then added to the lowpass image to give final output. Ikonomopoulos and Kunt (1985) describe their technique for image coding based on the refinement of the synthetic highs system, directional filtering. Once the image has been filtered, the result is 1 lowpass image and 16 directional images. The coding scheme proposed is lossy since high compression is the goal. When the image is filtered with a high-pass filter the result gives zero-crossings at the location of abrupt changes (edges) in the image. Each directional component is represented by the location and magnitude of the zero-crossing. Given that a small number of points result from this process, typically 6-10% of the total number of points, run length encoding proves efficient for this purpose. The low frequency component can be coded in two ways. As maximum frequency of this component is small, it can be resampled based on 2D sampling theorem and the resulting pixels can be coded in a standard way. Alternatively, transform coding may be used, with the choice of transform technique being controlled by the filtering procedure used. The transform coefficients may then be quantized and coded via Huffman coding (Huffman, 1952). The compression ratios obtained with this technique depend on many factors. The image being coded and the choice of cutoff frequency all play an important role in the final ratio obtained. The compression scheme can be adapted to the type of image being compressed. Zhou and Venetsanopoulos (1992) present an alternative spatial method called morpholological directional coding. In their approach, spatial image features at known resolutions are decomposed using a multiresolution morphological technique referred to as the feature-width morphological pyramid (FMP). Zhou and Venetsanopoulos (1992) report that nontrivial

30

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

spatial features, such as edges, lines, and contours within the image, determine the quality of the reproduced image for the human observer. It was this fact that motivated them to employ a stage in their coding technique that identifies these nontrivial features in order that they may be coded separately. Morphological directional coding schemes were developed to preserve nontrivial spatial features in the image during the coding phase. Such filtering techniques are used for feature separation, as they are spatial methods that are capable of selectively processing features of known geometrical shapes. A multiresolution morphological technique therefore decomposes image features at various resolutions. In this technique image decomposition is a multistage process involving a filter called an open-closing (OC) filter. Each filtered image from the current stage is used as the input to the next stage, and in addition the difference between the input and output images of each stage is calculated. The first N - 1 decomposed subimages (L 1 ... L N_ 1) are termed feature images and each contains image features at known resolutions. For example, L1 contains image features of width 1, L 2 has features of width 2, and so on. Each OC filter has a structuring element associated with it, with those for stage n progressively larger than for the previous stage n - 1. The structuring element defines the information content in each of the decomposed images. The decomposed F M P images contain spatial features in arbitrary directions. Therefore directional decomposition filtering techniques are applied to each of the F M P images in order to group features of the same direction together. Before this is implemented, the features in the F M P i m a g e s , L 2 , . . . , L N _ I , must be eroded to 1-pixel width. There are two reasons for this feature thinning phase (Zhou and Venetsanopoulos, 1992). First, the directional decomposition filter bank gives better results for features of 1 pixel width and second, it is more efficient and simpler to encode features of 1-pixel width. After the F M P images have been directionally decomposed, the features are further quantized by a nonuniform scalar quantizer. Each extracted feature is first encoded with a vector and then each vector is entropy encoded. The coarse image L N is encoded using conventional methods such as VQ. Both of these methods employ directional decomposition as the basis of their technique. Ikonomopoulos and Kunt (1985) implemented a more traditional approach in that the directional decomposition filters are applied directly to the image. In their method the compression ratio varies from image to image. The filter design depends on many factors, which in turn affect the compression ratio. Therefore Ikonomopoulos and Kunt (1985) state that these parameters should be tuned to the particular image because the quantity, content, and structure of the edges in the image determine the

SECOND-GENERATIONIMAGECODING

31

compression obtained. Despite these factors, compression in the order of 64:1 is reported with good image quality. The morphological filtering technique by Zhou and Venetsanopoulos (1992) separates the features into what they refer to as FMP images. Traditional directional decomposition techniques are applied to these FMP images in order to perform the coding process. The compression ratios reported by this method are reasonable at around 20:1.

IV. SEGMENTATION-BASEDAPPROACHES A. Overview

A general scheme of a segmentation-based image coding approach is represented in Fig. 14. The original image is first preprocessed in order to eliminate noise and small details. The segmentation is then performed in order to organize the image as a set of regions. These might represent the objects in the scene, or more generally some homogeneous group of pixels. Once regions have been generated, the coding step takes place. This is now composed of two different procedures: contour coding and texture coding. The former is responsible for coding the shape of each region so that it can be reconstructed later at the decoder site and the latter is responsible for coding the texture inside each region. These two procedures generate two bit-streams that, together, are used for the reconstruction of the original image. The segmentation-based approaches have strong motivations in the framework of second generation image coding. The visual data to be coded is generally more coherent inside a region that is semantically more meaningful than predefined blocks. The introduction of a semantic representation of the scene might increase the decorrelation of the data, thus providing a higher energy compaction with consequent improved compression performances.

[

Contour l

.. ~ Image Original

Coding[

Preprocessing .............. ] H'"'Segmentati~ .

.

.

.

.

.

.

.

.

.

.

.

.

.

+,.... .] Texture

.

"1 Coding FIGURE14. Genericschemefor segmentation-basedapproachto imagecoding.

32

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

Moreover, an object representation of the scene is the key point for dynamic coding (Ebrahimi et al., 1995). Each region can be coded independently from the others; this means that the best coding approach that suits the statistics of the single region can be applied. The introduction of a semantic representation of the image has another advantage: that of object interaction. This concept is particularly suitable for video sequences, and is one of the keypoints of the new MPEG-4 standard, but it can also be extended to still image coding. As we have mentioned, the HVS is able to recognize objects and automatically assign a different priority to an object of high or low interest. This can be simulated in a segmentation-based approach by association to visually or semantically important regions, high bandwidth, and to less crucial objects with low bandwidth. Research to predict and dynamically allocate the available bit rate, have been performed by Fleury et al. (1996) and Fleury and Egger (1997). In addition to these advantages, segmentation-based coding approaches suffer from some major drawbacks. First, the segmentation process is computationally expensive and generally not very accurate or automatic. Thus, it is still not possible to correctly analyze, in realtime, the semantic content of a generic image. This is a severe limitation for practical applications. Second, for each region we want to compress, it is necessary to code not only the texture information but also the contour information. This introduces an overhead that might even overcome the advantages obtained by coding a more coherent region. Finally, it has been shown that a semantic representation of the scene does not always provide homogeneous regions suitable for high compression purposes. In the next section, a brief review of important preprocessing techniques will be outlined. In Section IV.C, an overview of existing segmentation techniques will be proposed. A discussion of texture and contours coding will be presented in Sections IV.D and E. Finally, a review of major coding techniques based on a segmentation approach will be presented in the last four Sections, IV.F, G, H, and I.

B. Preprocessing

The purpose of preprocessing is to eliminate small regions within the image and remove noise generated in the sampling process. It is an attempt to model the action of the HVS and is intended to alter the image in such a way that the preprocessed image will resemble more accurately the human brain processes. There are various methods used to preprocess the image,

SECOND-GENERATION

IMAGE

CODING

33

all derived from properties of the HVS. Two properties commonly used are Weber's Law and the modulation transfer function (MTF) (Jang and Rajala, 1990, 1991; Civinlar et al., 1986). Marqu6s et al. (1991) suggest the use of Steven's Law. This accounts for a greater sensitivity of the HVS to gradients in dark areas as compared to light ones. For example, if B is the perceived brightness and I the stimulus intensity then:

B = K.I'. Therefore, by preprocessing according to Steven's Law, visually homogeneous regions will not be split and unnecessary and heterogeneous dark areas will not be falsely merged. In addition, the inverse gradient filter (Wang and Vagnucci, 1981) has also been implemented in order to give a lowpass response inside a region and an all-pass response on the region's contour (Kwon and Chellappa, 1993; Kocher and Kunt, 1986). This is an iterative scheme that employs a 3 x 3 mask of weighting coefficients. These coefficients are the normalized gradient inverse between the central pixel and its neighbors. If the image to be smoothed is expressed as an n x m array, whose coefficients p(i, j) are the graylevel of the image pixel at (i, j) with i = 1 . . . n and j = 1 . . . m , the inverse of the absolute gradient at (i, j) is then defined as

6(i, j : k, l) =

1

Ip(i + k, j + l) - p(i, j)]'

where k, j = - 1, 0, 1 but k and l are not equal to zero at the same time. This means that 6(i, j:k, l)s are calculated for the eight neighbors of (i, j); this is denoted the vicinity V(i, j). If p(i + k,j + 1 ) = p(i, j), then the gradient is zero and 6(i, j:k, l) is defined as 2. The proposed 3 x 3 smoothing mask is defined as:

w(i-- 1 , j - - l )

W(i, j) =

w(i-- 1,j)

w(i-- l,j + 1)

w( i, j -- 1)

w( i, j)

w( i, j + 1)

w(i + 1 , j - - 1)

w(i + 1,j)

w(i + 1,j + 1)

I

where w(i, j) = 89 and w(i + k,j + 1) = 89 k, l - - 1 , O, 1, but not 0 at the same time. The smoothed image is then given as 1

P(i, J) =

2 k=-I

j k , I)]- 16(i, j ' k , 1) for

1

2

w(i + k, j + l)p(i + k, j + l).

k = -1

Finally, the anisotropic diffusion filtering (Perona and Malik, 1990; Yon et al., 1996; Szirfinyi et al., 1998) is worth citing as a preprocessing method

34

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

FIGURE 15. Example of anisotropic diffusion applied to a natural image. On the left-hand side, the original image is displayed; on the right-hand side, the filtered version is represented.

because it is effective in smoothing image details by preserving edges information as shown in Fig. 15.

C. Segmentation Techniques." Brief Overview This section introduces some of the commonly used methods for segmenting an image. Segmentation groups similar pixels into regions and separates those pixels that are considered dissimilar. It may be thought of as representing an image by a disjoint covering set of image regions (Biggar et al., 1988). Many segmentation methods have been developed in the past (Pal and Pal, 1993; Haralick, 1983) and it is generally the segmentation method that categorizes the coding technique. Most of the image segmentation techniques today are applied to video sequences. Thus, they have access to motion information, and are extremely useful in improving their performances. We have focused here on still image coding, but motion information remains an important HVS feature. Thus in the following we will also refer to those techniques that integrate both spatial and temporal information to achieve better segmentation of the image.

1. Region Growin9 Region growing is a process that subdivides a (filtered) image into a set of adjacent regions, whose gray-level variation within the region does not exceed a given threshold. The basic idea behind region growing is that, given a starting point within the image, the largest set of pixels whose gray level is within the specified interval is found. This interval is adaptive in that it is allowed to move higher or lower on the grayscale in order to intercept the maximum number of pixels. Figure 16 illustrates the concept of regiongrowing for two contrasting images.

SECOND-GENERATION IMAGE CODING

35

FIGURE 16. Images a) and b) are, respectively the original test images "Table Tennis" and "Akiyo." Images c) and d) are the corresponding segmentations obtained through region growing.

2. Split and Merge Split-&-merge algorithms (Pavlidis, 1982) segment the image into sets of homogeneous regions. In general, they are based around the quadtree (Samet, 1989) data structure. Initially the image is divided into a predefined subdivision; then, depending on the segmentation criteria, adjacent regions are merged if they have similar gray-level variations or a quadrant is further split if large variations exist. An example of this method is displayed in Fig. 17.

FIGURE 17. An example of quadtree decomposition of the image "Table Tennis." Initial decomposition in square blocks is iteratively refined through successive split and merging steps.

36

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

FIGURE 18. Segmentation results obtained by applying the method proposed by Ziliani (1998).

3. K-Means Clustering K-means clustering is a segmentation method based on the minimization of the sum squared distances from all points in a cluster to a cluster center. First k-initial cluster centers are taken and the image vectors are iteratively distributed among the k-cluster domain. New cluster centers are computed from those results in such a way that the sum of the squared distances from all points in a cluster to a new cluster center has been minimized. It is interesting to note that this method can characterize each cluster and each pixel of the image with several features, including luminance, color, textures, etc., as described in Castagno (1998) and Ziliani (1998). In Fig. 18, an example of the segmentations obtained in applying the method proposed by Ziliani (1998) is presented.

4. Pyramidal Linking This method proposed by Burt et al. (1981) uses a pyramid structure where flexible links between the nodes of each layer are established. At the base of the pyramid the original image is assumed. The layers consist of nodes that comprise the feature values and other information as described by Ziliani and Jensen (1998). The initial value for the node of a layer is obtained by computing the mean of the certain area in the layer below. This is done for the all-nodes in a way that they correspond to partially overlapping regions. After this is done for the entire pyramid, father-son relationships are defined between the current layer and the layer below using those nodes that participated in the initial feature computation. Using these links, the feature values of all layers are updated again and afterwards new links are established. This is repeated until a stable state is reached. In Fig. 19 an example of the segmentation obtained in applying the method proposed in Ziliani and Jensen (1998) is represented.

SECOND-GENERATION IMAGE CODING

37

FICURE 19. These are the regions obtained by applying to "Table Tennis" the Pyramid Linking segmentation proposed by Ziliani and Jensen (1998).

5. Graph Theory

There are a number of image segmentation techniques that are based on the theory of graphs and their applications (Morris et al., 1986). A graph is composed of a set of "vertices" connected to each node by "links." In a weighted graph the vertices and links have weights associated with them. Each vertex need not necessarily be linked to every other, but if they are, the graph is said to be complete. A partial graph has the same number of vertices but only a subset of the links of the original graph. A "spanning tree" can be referred to as a partial graph. A "shortest spanning tree" of a weighted graph is a spanning tree such that the sum of its link weights is a minimum for many possible spanning trees. To analyze images using graph theory, the original image must be mapped onto a graph. The most obvious way to do this is to map every pixel in the original image onto a vertex in the graph. Other techniques generate a first over-segmentation of the image and map each region instead of each pixel. This reduces complexity and improves segmentation results because each node of the graph is already a coherent structure. Recently, Moscheni et al. (1998) have proposed an effective segmentation technique based on graphs. 6. Fractal Dimension

The fractal dimension D is a characteristic of the fractal model (Mandelbrot, 1982), which is related to properties such as length and surface of a curve. It provides a good measure of the perceived roughness of the surface of the image. Therefore, in order to segment the image, the fractal dimension across the entire image is computed. Various threshold values can then be used to segment the original image according to its fractal dimension.

38

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

D. Texture Coding

According to the scheme presented in Fig. 14, once segmentation of the image has been performed, it is necessary to code each defined region. By taking into account the hypothesis that the segmentation has generated luminance homogeneous regions, a first approach to code their texture is represented by the polynomial approximation approach presented in Section IV.D.1. However, we have already noticed that the need for a semantic representation of the scene that might be useful for dynamic coding applications does not always correspond to the definition of homogeneous regions. In these cases, a more general approach such as the shape-adaptive DCT transform (Section IV.D.2) is used. 1. Polynomial Approximation

In order to efficiently code the gray-level content of the regions, these are represented by an n-order polynomial. The basic idea behind polynomial fitting is that an attempt is made to model the gray-level variation within a region by an order-n polynomial while ensuring that the MSE between the predicted value and the actual is minimized. An order-0 polynomial would ensure that each pixel in the region is represented by the average intensity value of the region. An order-1 polynomial is represented by: z = a + bx + cy,

where z = new intensity value at (x, y). 2. Shape-Adaptive D C T

The shape-adaptive DCT (SADCT) proposed by Sikora (1995) and Sikora and Makai (1995) is currently very popular. The transform principles are the same as we have already introduced in Section III.C: The image is organized in N x N blocks of pixels as usual. Some of these will be completely inside the region to be coded while others will contain pixels belonging to the region and pixels outside the region to be coded. For those blocks completely contained in the region to be coded, no differences have been introduced from the standard DCT-based coder. For those blocks that contain some pixels of the region to be coded, a shift of all the pixels of the original shape to the upper bound of the block is first performed. Each column is then transformed, based on the DCT transform matrix defined by Sikora and Makai (1995). Then another shift to the left bound of the block is performed. This is followed by a DCT transform of each line of coefficients. This final step provides the SADCT coefficients for the block. This algorithm is efficient because it is simple and it generates a total

SECOND-GENERATION IMAGE C O D I N G

39

number of coefficients corresponding to the number of pixels in the region to be coded. Its main drawback is decorrelation of the nonadjacent pixels it introduces. Similar techniques also exist for the wavelet transforms (Egger et al., 1996).

E. Contours Coding

As illustrated in Fig. 14, a segmentation-based coding approach requires a contour coding step in addition to a texture coding step. This is necessary to correctly reconstruct the shape of the regions defined during the segmentation step. Contour coding can be a complex problem. The most simple solution is to record every pixel position in the region in a bitmap-based representation. This is not the most efficient approach but it can achieve good compression performances when combined with efficient statistical entropy coding. The trade-off between exact reconstruction of the region and the efficient coding of its boundaries has been the subject of much research (Rosenfeld and Kak, 1982; Herman, 1990). Freeman chain coding (1961) is one of the earlier and most referenced techniques that attempts to code region contours efficiently by representing the given contour with an initial starting position and a set of codes representing relative positions. The Freeman chain codes are shown in Fig. 20. In this coding process an initial starting point on the curve is stored via its (x, y) coordinates. The position of the next point on the curve is then located. This position can be in 1 of the 8 locations illustrated in Fig. 20. If, for example, the next position is (x, y - 1) then the pixel happens to lie in position 2 according to Freeman and hence a 2 is output. This pixel is then updated as the current position and the coding process repeats. The coding

3

2

1

\ t 7

4+)(-0. 5

./ + N 6

7

. .

FIGURE 20. Each number represents the Freeman chain code for each possible movement of the central pixel.

40

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

is terminated when either the original start point has been reached (closed contour), or no further points on the curve can be found (open contour). The code is an efficient representation of the contour because at least 3 bits are required to store each code in the chain; however further gains can be achieved by applying entropy coders or lossy contour coding techniques to the contours. In addition to chain coding, other approaches have been investigated. We cite the geometrical approximation methods (Gerken, 1994; Schroeder and Mech, 1995) and the methods based on mathematical morphology (Briggar, 1995). A recent technique based on polygonal approximation of the contour that provides progressive and efficient compression of region contours is one that was proposed by Le Buhan et al. (1998).

F. Region-Growing Techniques

Kocher and Kunt (1983) presented a technique based on region growing called contour texture modeling. The original image is preprocessed by the inverse gradient filter (Wang and Vagnucci, 1981) to remove picture noise in preparation for the region growing process. After the growing process, a large number of small regions are generated, some of which must be eliminated. This elimination is necessary in order to reduce the number of bits required to describe the segmented image, and thus increase compression ratio. It is performed on the basis of removal of small regions and merging of weakly contrasting regions. Regions whose gray-level variations differ slightly are considered weakly contrasting. In this technique, contour coding is performed in stages. First an orientation of each region contour is defined. Then spurious and redundant contour points are deleted. Also small regions are merged with nearby valid regions. Finally, the contours are approximated by line and circle segments and coded through differential coding of the successive end-of-segment addresses. Texture coding is achieved by representing the gray-level variation within the region by an nth-order polynomial function. As a final step, pseudorandom noise is added in order to produce a natural looking image. Civanlar et al. (1986) present an HVS-based segmentation coding technique in which a variation of the centroid linkage region growing algorithm (Haralick, 1983) is used to segment the image after preprocessing. In a centroid linkage algorithm the image is scanned in a set manner, for example, left to right or top to bottom. Each pixel is compared to the mean gray-level value of the already partially constructed regions in its neighborhood and if the values are close enough the pixel is included in the region

SECOND-GENERATION IMAGE C O D I N G

41

and a new mean is computed for the region. If no neighboring region has a close enough mean, the pixel is used to create a new segment whose mean is the pixel value. In the technique by Civanlar et al. (1986), the centroid linkage algorithm described here applies. However, if the intensity difference is less than an HVS visibility threshold, the pixel is joined to an existing segment. If the intensity differences between the pixel and its neighbor segments are larger than the thresholds, a new segment is started. The work by Kocher and Kunt (1983) provides the facility to preset the approximate compression ratio prior to the operation. This is achieved by setting the maximum number of regions that will be generated after the region growing process. The results obtained via their method are good both in terms of reconstructed image quality and compression ratio. However, they point out that the performance of their technique in terms of image compression and quality is optimal for images that are naturally composed of a small number of large regions. Civanlar et al. (1986) report good image quality and compression ratios comparable to those achieved by Kocher and Kunt (1983).

G. Split-and-Merge-Based Techniques

Kwon and Chellappa (1993) and Kunt et al. (1987) present a technique based on a merge-and-threshold algorithm. After the image has been preprocessed, the intensity difference between two adjacent regions is found. If this difference is less than or equal to k, which has been initialized to 1, the regions are merged and the average of the intensities is computed. A histogram of the merged image is computed and if separable clusters exist, the above steps are repeated; otherwise, the original image is segmented by thresholding the intensity clusters. When the overall process is complete the regions obtained may be represented by an nth order polynomial. The preceding method of segmentation extracts only homogeneous regions and thus for textured regions a large number of small homogeneous regions will be generated. In terms of image coding, it is more efficient to treat textured areas as one region as opposed to several small regions. Therefore, in addition to the homogeneous region extraction scheme, textured regions are also extracted and combined with the results of the uniform region segmentation. Multiple features are used in the texture extraction process, along with the recursive thresholding method using multiple 1D histograms. First, the image is regarded as one region. A histogram is then obtained within each region of the features to be used in the extraction process. The histogram showing the best clusters is selected and this corresponding region is then

42

N . D . BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

split by thresholding. These steps are repeated for all regions until none of the histograms exhibit clustering. Final segmentation is achieved by labeling the extracted uniform regions. If the area of such a region is covered by more than 50% of a textured region of type "X", then the uniform region is labeled as a textured region of that type. Adjacent uniform regions are merged with a texture region if they show at least one similar texture feature with the corresponding texture region. In terms of coding, uniform regions are represented by polynomial reconstructions. Texture regions are represented by a texture synthesis technique using the Gaussian Markov random field (GMRF) model (Chellappa et al., 1985). Encoding the image therefore involves storing information about the contours of the regions, polynomial coefficients of the uniform regions, GMRF parameters for textured regions, and a means of identifying each region. Variable bits are allocated for each component. Another approach based on a split-and-merge algorithm is that by Cicconi and Kunt (1977). Segmentation is performed by initially clustering the image using a standard K-means clustering algorithm (Section IV.C.3). Once the image has been segmented into feature homogeneous areas, an attempt to further reduce the redundancy inside the regions is implemented by looking for symmetries within the regions. In order to do this the medial axis transformation (MAT) (Pavlides, 1982) is used for shape description. The MAT is a technique that represents, for each region, the curved region descriptor. The MAT corresponds closely to the skeleton that would be produced by applying sequential erosion to the region. Values along the MAT represent the distance to the edge of the region and can be used to find its minimum and maximum widths. The histogram of the values will give the variation of the width. Once the MAT has been found, a linear prediction of each pixel in one side of the MAT can be constructed from pixels symmetrically chosen in the other side. Coding of the segmented image is performed in two stages--contours coding and texture coding. As the MAT associated with a region is reconstructible from a given contour, only contours have to be coded. Texture components in one part of the region with respect to the MAT may be represented by a polynomial function. However, representing the polynomial coefficients precisely requires a large number of bits. Therefore, the proposed method suggests defining the positions of 6 pixels, which are found in the same way for all regions, then quantizing these 6 values. These quantized values allow the unique reconstruction of the approximating second-order polynomial. Both of the preceding techniques are similar in that they employ a split-and-merge algorithm to segment the original image. However, Kwon and Chellappa (1993), state that better compression ratios may be obtained by segmenting the image into uniform and textured regions. These regions

S E C O N D - G E N E R A T I O N IMAGE C O D I N G

43

may be coded separately and in particular the textured regions may be more efficiently represented by a texture synthesis method, such as a GMRF model, as opposed to representing the textured region with many small uniform regions. Cicconi and Kunt's method (1977) segments the image into uniform regions and, in addition, they propose to exploit further redundancy in these regions by identifying symmetry within the regions. The gray-level variation within each of the uniform regions is represented using polynomial modeling. Cicconi and Kunt further developed a method for reducing the storage requirements for the polynomial coefficients. Despite the different methods used to represent both the contours and the gray-level variations within the regions, both methods report similar compression ratios. H. Tree~Graph-Based Techniques

Biggar et al. (1988) developed an image coding technique based on the recursive shortest spanning tree (RSST) algorithm (Morris et al., 1986). The RSST algorithm maps the original image onto a region graph so that each region initially contains only one pixel. Sorted link weights, associated with the links between neighboring regions in the image, are used to decide which link should be eliminated and therefore which regions should be merged. After each merge, the link weights are recalculated and resorted. The removed links define a spanning tree of the original graph. Once the segmentation is complete, the spanning tree is mapped back to image matrix form, thus representing the segmented image. The regions generated are defined by coding the lines that separate the pixels belonging to different regions. The coded segmented image consists of three sources: a list of coordinates from which to start tracing the edges; the edge description; and a description of the intensity profile within each region. Although the intensity profile within the region could be represented as a simple flat intensity plateau, it has been suggested by Kunt et al. (1985) and Kocher and Kunt (1983) that a better result is achievable by higher-order polynomial representation. Biggar et al. (1988) suggest that to embed the polynomial fitting procedure at each stage of the region-merging process, as Kocher and Kunt (1983) do, would be computationally too expensive. Therefore in this case a flat intensity plane is used to generate the regions and polynomials are fitted after the segmentation is complete. The edge information is extracted from the segmented image using the algorithm for thin line coding by Kaneko and Okudaira (1985). A similar technique to the forementioned, based on the minimum spanning forest MSF is reported by Leou and Chen (1991). Segmentation and contour coding and performed exactly as described by Biggar et al. (1988), however

44

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

the intensity values within a segmented region are coded with polynomial representation. Here a texture extraction scheme is used, based on the assumption that lights are cast overhead on the picture and that the gray values vary according to the distance to the corresponding region centroid. After texture extraction, the regions have a high pixel-to-pixel correlation. Therefore, for simplicity and efficiency, a polynomial representation method is used to encode the texture. This is achieved by representing any row of the image by a polynomial. A different graph theory approach is presented by Kocher and Leonardi (1986) based on the region adjacency graph (RAG) data structure (Pavlidis, 1982). The RAG is again a classical map graph with each node corresponding to a region and links joining nodes representing adjacent regions. The basic idea of the segmentation technique is that a value that represents the degree of dissimilarity existing between two adjacent regions is associated to a graph link. The link that exhibits the lowest degree of dissimilarity is removed and the two nodes it connects are merged into one. This merging process is repeated until a termination criterion is reached. Once complete, the RAG representation is mapped back to the image matrix form, and thus a segmented image is created. The segmented image is coded using a polynomial representation of the regions and gives very good compression ratios. All of the preceding methods are based on similar graph structures that enable the image to be mapped to the graph form in order to perform segmentation. The techniques by Biggar et al. (1988) and Kocher and Leonardi (1986), both model the texture within the image via a polynomial modeling method. However, Kocher and Leonardi report on compression ratios of much larger proportions than those from Biggar et al. (1988). Leou and Chen (1991) and Pavlidis (1982) implement a segmentation technique identical to that presented by Biggar et al. However, Leou and Chen point out that better compression ratios can be achieved by firstly performing a texture extraction process and then modeling the texture by polynomials as opposed to polynomial functions. The compression ratio achieved via this method is an improvement on that reported by Biggar et al (1988).

A more recent technique belonging to graph-based segmentation techniques is the one proposed by Moscheni et al. (1998) and Moscheni (1997).

I. Fractal-Based Techniques

In the previous sections, various methods for image segmentation have been suggested that lend themselves to efficient compression of the image. Most

SECOND-GENERATION IMAGE CODING

45

of these techniques segment the image into regions of homogeneity and thus, when a highly textured image is encountered, the result of the segmentation is many small homogeneous regions. Jang and Rajala (1990, 1991) suggest a technique that segments the image in terms of textured regions. They also point out that in many cases previous segmentation-based coding methods are best suited to head and shoulder type (closeups) images and that results obtained from complex natural images are often poor. In their technique the image is segmented into textually homogeneous regions as perceived by the HVS. Three measurable quantities are identified for this purpose: the fractal dimension; the expected value; and the just noticeable difference. These quantities are incorporated into a centroid linkage region growing algorithm that is used to segment the image into three texture classes: perceived constant intensity; smooth texture; and rough texture. An image coding technique appropriate for each class is then employed. The fractal dimension D value is then thresholded to determine the class of the block. The following criteria are used to categorize the particular block under consideration: D < D 1 perceived constant intensity; D 1 < D < O 2 smooth texture; and D > O 2 rough texture. After this segmentation process the boundaries of the regions are represented as a two-tone image and coded using arithmetic coding. The intensities within each region are coded separately according to their class. Those of class 1, perceived constant intensity, are represented by the average value of the region. Class 2, smooth texture, and class 3, rough texture, are encoded by polynomial modeling. It should be noted from the description in Section IV.D.1 that polynomial modeling leads to some smoothing and hence may not be useful for rough texture. Therefore, it is not clear as to why Jang and Rajala chose this method of representation for the class 3 regions. Each of the various segmentation techniques used group pixels according to some criterion, whether it was homogeneity, texture, or pixels within a range of gray-level values. The problem that arises after segmentation is how to efficiently code the gray-level values within the region. The most basic representation of gray level within a region is by its mean value. This will result in a good compression, especially if the region is large; however the quality of the decoded image will be poor. In most cases gray-level variation is approximated by a polynomial function of order two. The results obtained by polynomial approximation can be visually poor, especially for highly

46

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

textured images. It is for this reason that more researchers are representing highly textured regions by texture synthesis techniques such as GMRF. These methods do not gain over the compression ratios obtained using polynomial approximation, but the quality of reproduced image is claimed to be improved (Kwon and Chellappa, 1993). Another approach used to encode the gray-level variation is by representing the variations by polynomials, as was done by Leou and Chen (1991). This method is of similar computational complexity but the results in terms of compression ratio and image quality are claimed to be better than polynomial reconstruction. As stated by Jang and Rajala (1990, 1991), many of the forementioned segmentation-based techniques are not sufficient when the input image is a natural one, that is, an image of a real scene. These may be images that contain highly textured regions and when segmented using conventional methods, the resulting textured region is composed of a large number of small regions. These small regions are often merged or removed in order to increase the compression ratio and as a result the decoded image appears to be very unnatural. Therefore, Jang and Rajala (1990, 1991) employed the use of the fractal dimension to segment the image. This ensures that the region is segmented into areas that are similar in terms of surface roughness, this being the result of visualizing the image in 3D with the third dimension that of gray-level intensity. However, once segmented into regions of similar roughness the method employed to code the identified areas is similar to that of traditional segmentation-based coding methods, that is, polynomial modeling. This polynomial modeling, as reported by Kwon and Chellappa (1993) does not suffice for the representation of highly textured regions and they suggest the use of texture synthesis. Therefore, it may be concluded that a better segmentation-based coding method might employ the fractal dimension segmentation approach coupled with texture synthesis for textured region representation. As discussed in this section there are a number of different methods that can be used in the segmentation process. Table 2 summarizes the methods used in the techniques that employ a segmentation algorithm as part of the coding process.

V. SUMMARY AND CONCLUSIONS

This chapter has reviewed second-generation image coding techniques. These techniques are characterized by their exploitation of the human visual system. It was noted that first-generation techniques are based on classical information theory and are largely statistical in nature. As a result they tend to deliver compression ratios of approximately 2:1. The problem with the

SECOND-GENERATION IMAGE CODING

47

TABLE 2 TEXTURECODINGEMPLOYEDIN SEGMENTATIONALGORITHMS Technique HVS-based segmentation (Civinlar et al., 1986) Segmentation-based (Kwon and Chellappa, 1993) Symmetry-based coding (Cicconi and Kunt, 1977) RSST-based (Biggar et al., 1988) MSF-based (Leou and Chen, 1991) RAG-based (Kochen and Leonardi, 1986) Fractal dimension (Jang and Rajala, 1990)

Texture Coding Method Polynomial function Polynomial & GMRF Polynomial function Polynomial function Polynomial function Polynomial function Polynomial function

early techniques is that they ignore the semantic content of an image. In contrast, second-generation methods break an image down into visually meaningful subcomponents. Typically these are edges, contours, and textures of regions or objects in the image or video. Therefore, not surprisingly, these subdivisions are a strong theme in the emerging MPEG-4 standard for video compression. In addition, second-generation coding techniques often offer scalability where the user can trade picture quality for increased compression. An overview of the human visual system was presented to demonstrate how many of the more successful techniques closely resemble the operation of the human eye. It was explained that the HVS is particularly sensitive to sharp boundaries in a scene and why it finds gradual changes more difficult to identify--with the result that detail in such scenes can be missed. Early coding work made use of the frequency sensitivity of the eye and it can be concluded that the eye is particularly efficient at contrast resolution. The techniques considered herein were categorized into two broad approaches: transform-based coding; and segmentation-based coding. Transform-based coding initially decomposes/transforms the image into low and high frequencies (c.f. the HVS) to highlight these features that are significant to the HVS. It was observed that directional filtering is a technique that more closely matches the operation of the HVS. Following this initial stage, which is generally without loss, the transformed image is quantized and ordered according to visual importance. A range of methods from the simple uniform quantization to the more complex vector quantization can achieve this, with differing visual results in terms of image quality. Much research is still ongoing to find a suitable measure for the effect of second-generation coding techniques on the visual quality of an image. Such measures attempt to quantify the distortion introduced by the technique by providing a figure for the visual distance between the original image and the

48

N . D . BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

one that has been coded and decoded. Early measures used in firstgeneration image coding, such as MSE, are not necessarily optimal in the HVS domain. The discrete cosine transform is the basis of many transform-based codings. For example, it features in standards ranging from JPEG to MPEG-4. This can be attributed to its useful balance between complexity and performance and it was noted that it approaches the optimal KL transform for highly correlated signals (such as natural images). The multiscale and pyramidal approaches to image coding are multiresolution approaches, again paralleling the HVS. In addition, they offer the possibility of progressive image transmission--an attractive feature. For these reasons, most of the current research in transform-based coding is focused on wavelet codings; indeed, this will be the basis of JPEG 2000, the new standard for still image coding. Although wavelet coding is a generalization of pyramidal coding, it does not increase the number of samples over the original and it also preserves important perceptual information such as edges. As an example of subband coding, wavelets allow each subband to be separately coded allowing a greater bit allocation to those subbands considered to be visually important; for example, the power in natural images is in the lower frequencies. Unlike DCT, there are no blocking artifacts with wavelets, although they do have their own artifact called "ringing"-- particularly around high contrast edges. A number of different wavelets exist and research is ongoing into finding criteria to assist in choosing the most suitable wavelet given the nature of the image. Segmentation-based approaches to image coding segment the original image into a set of regions following a preprocessing step to remove noise and small details. Given a set of regions, it is then "only" necessary to record the contour of each region and to code the texture within it. The segmentation approach aims to identify regions with semantic meaning (and hence a high correlation) rather than generic blocks. In addition, it is then possible to apply different codings to different regions as required. This is particularly important for video coding where research is considering how to predict the importance of regions and hence the appropriate allocation of bandwidth. The drawback to the segmentation approach is that it is not currently possible to correctly analyze the semantic content of a generic image in realtime. This section of the paper considered six approaches to segmentation: region growing; split and merge; K-means clustering; pyramidal linking; graph theory; and fractal dimension. All, except perhaps the first two, are still being actively researched. Methods for texture representation range from using the mean value, through polynomial approximations to texture synthesis techniques. Both mean value and polynomial approximations

SECOND-GENERATION IMAGE CODING

49

(which tend to be second-order) yield poor quality images, especially if the image is highly textured. It must be remembered that the semantic representation does not always correspond to the definition of a homogeneous region. At present, the shape-adaptive DCT, particularly that proposed by Sikora (1995), is very popular. Most current research is concentrating on texture synthesis techniques, for example, GMRF. In coding the contours of segmented regions there is a trade-off between exactness and efficiency. Freeman's (1961) chain code is probably the most referenced technique in the literature, but has been surpassed by techniques based on geometrical approximations, mathematical morphology, and the polygonal approximation of Le Baham et al. (1998). We conclude that the best approach to segmentation-based coding is currently a technique that uses fractal dimensions for the segmentation phase and texture synthesis techniques for the texture representation. The future of image and video coding will probably be driven by multimedia interaction. Coding schemes for such applications must support object functionalities such as dynamic coding and object scalability. Initial research is directed at how to define the objects. Such object-based coding is already being actively pursued in the field of medical imaging. The concept is to define 3D models of organs such as the heart and then, instead of sending an image, the parameters of the model that best match the current data are sent and completed with an error function. This work is still in its infancy. All of the techniques reviewed here have their relative merits and drawbacks. In practice, the choice of technique will often be influenced by non-technical matters such as the availability of an algorithm, or its inclusion in an imaging software library. A direct comparison between techniques is difficult as each is based on different aspects of the HVS. This is often the reason why a technique performs well on one type or source of images but not on others. Comparing the compression ratios of lossy techniques is meaningless if image quality is ignored. Therefore, until a quantitative measure of image quality is established, direct comparisons are really not possible. In the meantime, shared experiences and experimentation for a particular application will have to provide the best method of determining the appropriateness of a given technique.

ACKNOWLEDGMENTS

The authors would like to thank Julien Reichel, Marcus Nadenau, and Pascal Fleury for their contributions and useful suggestions.

50

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI REFERENCES

Akansu, N., Haddad, R. A., and Caglar, H. (1993). The binomial QMF-wavelet transform for multiresolution signal decomposition, IEEE Trans. Signal Proc., 41: 13-19. Berger, T. (1972). Optimum quantizers and permutation codes, IEEE Trans., Information Theory, 18: (6), 756-759. Bigger, M., Morris, O., and Costantinides, A. (1988). Segmented-image coding: Performance comparison with the discrete cosine transform, IEEE Proc., 135: (2), 121-132. Blackwell, H. (1946). Contrast thresholds of the human eye, Jour. Opt. Soc. Am., 36: 624-643. Brigger, P. (1995). Morphological Shape Representation Using the Skeleton Decomposition: Application to Image Coding, PhD Thesis No. 1448, EPFL, Lausanne, Switzerland. Burt, P. J. and Adelson, E. H. (1983). The Laplacian pyramid as a compact image code, IEEE Trans. Comm., COM-31: (4), 532-540. Burt, P., Hong, T. H., and Rosenfeld, A. (1981). Segmentation and estimation of region properties through cooperative hierarchical computation, IEEE Trans. Syst., Man, Cyber., SMC-11: 802-809. Caglar, H., Liu, Y., and Akansu, A. N. (1993). Optimal PR-QMF design for subband image coding, Jour. Vis. Comm. and Image Represent., 4: 242-253. Castagno, R. (1998). Video Segmentation Based on Multiple Features for Interactive and Automatic Multimedia Applications, PhD Thesis, Swiss Federal Institute of Technology, Lausanne. Chellapa, R., Chatterjee, S., and Bagdazian, R. (1985). Texture synthesis and compression using Gaussiajn-Markov random fields, IEEE Trans. Syst. Man Cybern. SMC, 15: 298-303. Chen, W. H., Harrison Smith, C., and Fralick, S. C. (1977). A fast computational algorithm for the Discrete Cosine Transform, IEEE Trans. Comm., 1004-1009. Cicconi, P. and Kunt, M. (1977). Symmetry-based image segmentation, Soc. Photo Optical Instrumentation Eng. (WPIE), 378-384. Cicconi, P. et al. (1994). New trends in image data compression, Comput. Med. Imaging Graph, 18: (2), 107-124. Civanlar, M., Rajala, S., and Lee, X. (1986). Second generation hybrid image-coding techniques, SPIE-Visual Comm. Image Process, 707: 132-137. Coifman, R. R. and Wickerhauser, M. V. (1992). Entropy-based algorithms for best basis selection, IEEE Trans. Inform Theory, 38: 713-718. Cornsweet, T. N. (1970). Visual Perception, New York: Academic Press. Crochiere, R. E., Weber, S. A., and Flanagan, F. L. (1976). Digital coding of speech in sub-bands, Bell Syst. Tech. J., 1069-1085. Croft, L. H. and Robinson, J. A. (1994). Subband image coding using watershed and watercourse lines of the wavelet transform, IEEE Trans. Image Proc., 3: 759-772. Croisier, X., Esteban, D., and Galand, C. (1976). Perfect channel splitting by use of interpolation, decimation, tree decomposition techniques, Proc. lnt'l Conf. Inform. Sci./Systems, 443-446. Daubechies, I. (1998). Orthonormal bases of compactly supported wavelets, Comm. Pure Appl. Math., 41: 909-996. Daubechies, I. (1993). Orthonormal bases of compactly supported wavelets II, Variations on a theme, SIAM J. Math. Anal., 24: 499-519. Daubechies, I. (1998). Factoring wavelet transforms into lifting steps, J. Fourier Anal. Appl., 4: (4), 245-267. Dinstein, K., Rose, A., and Herman, A. (1990). Variable block-size transform image coder, IEEE Trans. Comm., 2073-2078.

SECOND-GENERATION IMAGE CODING

51

Ebrahimi, T. (1997). MPEG-4 video verification model: A video encoding/decoding algorithm based on content representation, Signal Processing: Image Comm., 9: (4), 367-384. Ebrahimi, T. et al. (1995). Dynamic coding of visual information, technical description ISO/IEC JTC1/SC2/WGll/M0320, MPEG-4, Swiss Federal Institute of Technology. Egger, O., Fleury, P., and Ebrahimi, T. (1996). Shape adaptive wavelet transform for zerotree coding, European Workshop on Image Analysis and Coding, Rennes. Fleury, P. and Egger, O. (1997). Neural network based image coding quality prediction, ICASSP, Munich. Fleury, P., Reichel, J., and Ebrahimi, T. (1996). Image quality prediction for bitrate allocation, in IEEE Proc. ICIP, 3: 339-342. Freeman, H. (1961). On the encoding of arbitrary geometric configuration, IRE Trans. Electronic Computers, 10: 260-268. Froment, J. and Mallat, S. (1992). Second generation compact image coding with wavelets, in Wavelet: A Tutorial in Theory and Applications, C. K. Chui, ed., San Diego: Academic Press. Gerken, P. (1994). Object-based analysis-synthesis coding of image sequences at very low bit rates, IEEE Trans. Circuits, Systems Video Technol., 4: (3), 228-235. Golomb, S. (1966). Run length encodings, IEEE Trans. Inf. Theory, IT-12: 399-401. Haber, R. N. and Hershenson, M. (1973). The Psychology of Visual Perception, New York: Holt, Rinehart and Winston. Haralick, R. (1983). Image segmentation survey, in Fundamentals in Computer Vision, Cambridge: Cambridge University Press. Harris, F. (1978). On the use of windows for harmonic analysis with the discrete Fourier transform, Proc. IEEE, 66: (1), 51-83. Herman, T. (1990). On topology as applied to image analysis, Computer Vision Graphics Image Proc., 52:409-415. Huffman, D. (1952). A method for the construction of minimum redundancy codes, Proc. IRE, 40: (9), 1098-1101. Ikonomopoulos, A. and Kunt, M. (1985). High compression image coding via directional filtering, Signal Processing, 8: 179-203. Jain, K. (1989). Image transforms, in Fundamentals of Digital Image Processing, Chapter 5, Englewood Cliffs, NJ: Prentice-Hall Information and System Science Series. Jain, K. (1976). A fast Karhunen Loeve transform for a class of random processes, IEEE Trans. Comm., COM-24: 1023-1029. Jang, J. and Rajala, S. (1991). Texture segmentation-based image coder incorporating properties of the human visual system, in Proc. ICASSP'91, 2753-2756. Jang, J. and Rajala, S. (1990). Segmentation based image coding using fractals and the human visual system, in Proc. ICASSP'90, 1957-1960. Jayant, N., Johnston, J., and Safranek, R. (1993). Signal compression based on models of human perception, Proc. IEEE, 81: (10), 1385-1422. Johnston, J. (1980). A filter family designed for use in Quadrature mirror filter banks, Proc. Int'l Conference on Acoustics, Speech and Signal Processing. ICASSP, 291-294. Jordan, L., Ebrahimi, T., and Kunt, M. (1998). Progressive content-based compression for retrieval of binary images, Computer Vision and Image Understanding, 71: (2), 198-212. Kaneko, T. and Okudaira, M. (1985). Encoding of arbitrary curves based on the chain code representation, IEEE Trans. Comm. Comm-33, 7: 697-707. Karhunen, H. (1947). Uber Lineare Methoden in der Wahrscheinlichkeits-Rechnung, Ann. Acad. Science Fenn, A.I: (37). Kocher, M. and Kunt, M. (1983). Image data compression by contour texture modelling, Proc. Soc. Photo-Optical Instrumentation Eng. (SPIE), 397: 132-139.

52

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

Kocher, M. and Leonardi, R. (1986). Adaptive region growing technique using polynomial functions for image approximation, Signal Processing, 11: 47-60. Kovalevsky, X. (1993). Topological foundations of shape analysis: Shape picture-math, description shape grey-level images, NATO ASI Series F: Comput. Systems Sci., 126: 21-36. Kunt, M. (1998). A vision of the future of multimedia technology, Mobil Multimedia Communication, Chapter 41, pp. 658-669, New York: Academic Press. Kunt, M., Benard, M., and Leonardi, R. (1987). Recent results in high compression image coding, IEEE Trans. Circuits and Systems, CAS-34: 1306-1336. Kunt, M., Ikonomopoulos, A., and Kocher, M. (1985). Second generation image coding, in Proc. IEEE, 73: (4), 549-574. Kwon, O. and Chellappa, R. (1993). Segmentation-based image compression, Optical Engineering, 32: (7), 1581-1587. Lai, Yung-Kai and Kuo, C,-C. Jay (1998a). Wavelet-based perceptual image compression, IEEE International Symposium on Circuits and Systems, Monterey, California, May 31-June 3, 1998. Lai, Yung-Kai and Kuo, C.-C. Jay (1998b). Wavelet image compression with optimized perceptual quality, Conference on "Applications of Digital Image Processing XXI," SPIE's Annual Meeting, San Diego, CA, July 19-24, 1998. Leou, F. and Chen, Y. (1991). A contour based image coding technique with its texture information reconstructed by polyline representation, Signal Processing, 25: 81-89. Lewis, S. and Knowles, G. (1992). Image compression using the 2-D wavelet transform, IEEE Trans. Image Processing, 1: 244-250. Lin, Fu-Huei and Mersereau, R. M. (1996). Quality measure based approaches to MPEG encoding, in Proc. ICIP, 3: 323-326, Lausanne, Switzerland, September 1996. Lo6ve, M. (1948). Fonctions aleatoires de second ordre, Processus stochastiques et mouvement brownien, P. Levvey, ed., Paris: Hermann. Lu, J., Algazi, V. R., and Estes, R. R. (1996). A comparative study of wavelet image coders, Optical Engineering, 35: (9), 2605-2619. Macq, B. (1989). Perceptual Transforms and Universal Entropy Coding For an Integrated Approach to Picture Coding, PhD Thesis, Universitie Catholique de Louvain, Louvain-laNeuve, Belgium. Mallat, S. G. (1989a). Multifrequency channel decomposition of images and wavelet models, IEEE Trans. Acoustics, Speech and Signal Processing, 37: 2091-2110. Mallat, S. G. (1989b). A theory of multiresolution signal decomposition: the wavelet representation, IEEE Trans. Pattern Anal. and Machine lntell., 11: 674-693. Mallat, S. G. (1991). Zero-crossing of a wavelet transform, IEEE Trans. Inform. Theory, 37: 1019-1033. Mallat, S. G. and Zhong, S. (1991). Compact coding from edges with wavelets, in Proc. ICASSP'91, 1745-2748. Mandlebrot, B. (1982). The Fractal Geometry of the Nature, 1st edition, New York: Freeman. Mannos, J. L. and Sakrison, D. J. (1974). The effects of a visual fidelity criterion on the encoding of images, IEEE Trans. Information Theory, 20: (4), 525-536. Marques, F., Gasull, A., Reed, T., and Kunt, M. (1991). Coding-oriented segmentation based on Gibbs-Markov random fields and human visual system knowledge, in Proc. ICASSP'91, 2749-2752. Miyahara, M., Kotani, K., and Algazi, V. R. (1992). Objective picture quality scale (PQS) for image coding, Proc. SID Symposium for Image Display, 44: (3), 859-862. Morris, O., Lee, M., and Constantinides, A. (1986). Graph theory for image analysis: An approach based on the shortest spanning tree, IEEE Proc. F, 133: (2), 146-152. Moscheni, F. (1997). Spatio-Temporal Segmentation and Object Tracking: An Application to

SECOND-GENERATION IMAGE CODING

53

Second Generation Video Coding, PhD Thesis, Swiss Federa; Institute of Technology, Lausanne. Moscheni, F., Bhattacharjee, S., and Kunt, M. (1998). Spatiotemporal segmentation based on region merging, IEEE Trans. Pattern Anal. Mach. Intell., 20: (9), 897-915. Nadenau, M. and Reichel, J. (1999). Compression of color images with wavelets under consideration of HVS, Human Vision and Electronic Imaging, IV, San Jose. Narasirnha, M. J. and Peterson, A. M. (1978). On the computation of the Discrete Cosine Transform, IEEE Trans. Comm., COM-26: 934-936. Osberger, W., Maeder, A. J., and Bergmann, N. (1996). A perceptually based quantization technique for MPEG encoding, Proc. SPIE, Human Vision and Electronic Imaging, 3299: 148-159, San Jose, CA. Pal, N. and Pal, S. (1993). A review on image segmentation techniques, in Pattern Recognition, 26: (9), 1277-1294. Pavlidis, T. (1982). Algorithms for Graphics and Image Processing. 1st edition, Rockville, MD: Computer Science Press. Pearson, D. E. (1975). Transmission and Display of Pictorial Information, London: Pentatech. Perona, P. and Malik, J. (1990) Scale-space and edge detection using anisotropic diffusion, IEEE Trans. Pattern Anal. Machine Intell., 12: (7), 629-639. Poirson, A. B. and Wandell, B. A. (1996). Pattern-color separable pathways predict sensitivity to simple colored patterns, Vision Research, 36: (4), 515-526. Poirson, A. B. and Wandell, B. A. (1993). Appearance of colored patterns: pattern-color separability, Optics and Image Science, 10: (12), 2458-2470. Ramchandran, K. and Vetterli, M. (1993). Best wavelet packet bases in a rate-distortion sense, IEEE Trans. Image Processing, 2:160-175. Rose, A. (1973). Vision--Human and Electronic, New York: Plenum. Rosenfeld, A. and Kak, A. C. (1982). Digital Picture Processing, San Diego: Academic Press. Said, A. and Pearlman, W. A. (1996). A new, fast, and efficient image codec based on set partitioning in hierarchical trees, IEEE Trans. Circuits and Systems for Video Technology, 6: (3), 243-250. Salembier, P. and Kunt, M. (1992). Size-sensitive multiresolution decomposition of images with rank order based filters, Signal Processing, 27: 205-241. Samet, H. (1989a). Applications of Spatial Data Struc'tures, 1st edition, Reading, MA: Addison-Wesley. Samet, H. (1989b). The Design and Analysis of Spatial Data Structures, 1st edition, Reading, MA: Addison-Wesley. Schalkoff, R. J. (1989). Digital Image Processing and Computer Vision, Singapore: John Wiley and Sons. Schreiber, W. F. (1963). The mathematical foundation of the synmthetic highs systems, MIT, RLE Quart. Progr. Rep., No. 68, p. 140. Schreiber, W. F., Knapp, C. F., and Kay, N. D. (1959). Synthetic highs, an experimental TV bandwidth reduction system, Jour. SMPTE, 68: 525-537. Schroeder, M. R. and Mech, R. (1995). Combined description of shape and motion in an object based coding scheme using curved triangles, IEEE Int. Conf. Image Proc., Washington, 2: 390-393. Senoo, T. and Girod, B. (1992). Vector quantization for entropy coding of image subbands, IEEE Trans. Image Proc., 1: 526-533. Shapiro, J. M. (1993). Embedded image coding using zerotrees of wavelet coefficients, IEEE Trans. Signal Proc., 41: (12), 3445-3462. Sikora, T. (1995). Low complexity shape-adaptive DCT for coding of arbitrarily shaped image segments, Signal Processing: Image Communication, 7: 381-395.

54

N.D. BLACK, R. J. MILLAR, M. KUNT, M. REID, AND F. ZILIANI

Sikora, T. and Makai, B. (1995). Shape-adaptive DCT for generic coding of video, IEEE Trans. Circuits and Systems for Video Technol., 5: 59-62. Simoncelli, P. and Adelson, E. H. Efficient Pyramid Image Coder (EPIC), a public domain software available from URL: ftp.'//ftp.cis.upenn.edu/pub/eero/epic.tar.Z (Jan. 2000). Smith, M. J. T. and Barnwell, T. P. (1986). Exact reconstruction techniques for tree structured subband coders, IEEE Trans. Acoustics, Speech, and Signal Processing, 34: 434-441. Sziranyi, T., Kopilovic, I., Toth, B. P. (1998). Anisotropic diffusion as a preprocessing step for efficient image compression, 14th ICPR, Brisbane, IAPR, Australia, pp. 1565-1567, August 16-20, 1998. Taubman, D. and Zakhor, A. (1994). Multirate 3-D subband coding of video, IEEE Trans. Image Proc., 572-588. Toet, A. (1989). A morphological pyramid image decomposition, Pattern Recognition Letters, 9: 255-261. Vaidyanathan, P. P. (1987). Theory and design of M channel maximally decimated QMF with arbitrary M, having perfect reconstruction property, IEEE Trans. Acoustics, Speech, and Signal Processing. Van den Branden Lambrecht (1996). Perceptual Models and Architectures for Video Coding Applications, PhD Thesis, Swiss Federal Institute of Technology, Lausanne, Switzerland. Vetterli, M. (1984). Multi-dimensional subband coding: some theory and algorithms, IEEE Trans. Acoustics, Speech, and Signal Processing, 97-112. Wandell, A. (1995). Foundations of Vision, Sunderland, MA: Sinauer Associates, Inc. Publishers. Wang, T. P. and Vagnucci, A. (1981). Gradient inverse weighted smoothing scheme and the evaluation of its performance, Computer Graphics and Image Processing, 15, 167-181. Welch, T. (1977). A technique for high performance data compression, IEEE Computing, 17: (6), 8-19. Westen, S. J. P., Lagendijk, R. L., and Biemond, J. (1996a). Optimization of JPEG color image coding under a human visual system model, Proc. SPIE Human Vision and Electronic Imaging, 2657: 370-381, San Jose, CA. Westen, S. J. P., Lagendijk, R. L., and Biemond, J. (1996b). Spatio-temporal model of human vision for digital video compression, Proc. SPIE, Human Vision and Electronic Imaging, 3016: 260-268. Winkler, S. (1998). A perceptual distortion metric for digital color images, in Proc. ICIP, 1998, Chicago, IL, 3: 399-403. Woods, J. and O'Neil, S. (1986). Subband coding of images, IEEE Trans. Acoustics, Speech, and Signal Processing, 1278-1288. You, Y., Xu, W., Tannenbaum, A., and Kaveh, M. (1996). Behavioral analysis of anisotropic diffusion in image processing, IEEE Trans. Image Processing, 5: (11), 1539-1553. Zhou, Z. and Venetsanopoulos, A. N. (1992). Morphological methods in image coding, Proc. Int'l Conf. Acoust., Speech, and Signal Processing ICASSP, 3: 481-484. Ziemer, R. Tranter, W., and Fannin, D. (1989). Signals and Systems: Continuous and Discrete, 2nd edition, New York: Macmillan. Ziliani, F. (1998). Focus of attention: an image segmentation procedure based on statistical change detection, Internal Report 98.02, LTS, Swiss Federal Institute of Technology, Lausanne, Switzerland. Ziliani, F. and Jensen, B. (1998). Unsupervised image segmentation using the modified pyramidal linking approach, Proc. IEEE Int. Conf. Image Proc., ICIP'98.