Pattern Recognition Letters 23 (2002) 1119–1127 www.elsevier.com/locate/patrec
A comparative study of focused Jpeg compression, FJpeg Claudio M. Privitera *, Michela Azzariti, Yeuk F. Ho, Lawrence W. Stark Neurology and Telerobotics Units, School of Optometry, 485 Minor Hall, University of California, Berkeley, CA 94720-2020, USA Received 8 May 2001; received in revised form 6 September 2001
Abstract Eye movements are one important component of human vision: only specific regions of the visual scene are fixated and processed by the brain at high resolution. The rest of the image is sampled at lower and coarser resolution by the retina; but still, the image is perceived as uniformly clear. A focused Jpeg encoder, FJpeg, or visual-compression has been developed to operate in a similar fashion. Its implementation is based on specific image processing algorithms capable of predicting human regions-of-interest, ROIs, and on a corresponding differentiated quantization of the DCT coefficients. With FJpeg, the fidelity of ROIs are maintained by preserving their Jpeg–DCT frequency coefficients and balancing this by strongly compressing the residual part of an image, peripheral to the ROIs. It is possible in this way to increase the average compression ratio while preserving visual quality of the image. We present carefully designed experiments to demonstrate that subjects judged FJpeg visual-compression images as superior to standard Jpeg images over a number of conditions. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Eye movements; Regions-of-interest; Jpeg; Image compression; Visual comparisons
1. Introduction The scanpath is a repetitive, idiosyncratic and alternating sequence of rapidly jumping saccadic eye movements, EMs, and fixations. Each fixation lasts about one-third of a second to enable visual processing. In their role as an important part of human vision, EMs only allow the eye and visual brain to sample or foveate several specific regionsof-interest, ROIs, of the visual field at high resolution. The rest of the image is viewed by the
*
Corresponding author. Tel.: +1-510-642-5309; fax: +1-510642-7196. E-mail address:
[email protected] (C.M. Privitera).
peripheral retina at coarser resolution. We do not consciously appreciate this lack of resolution in the periphery and indeed, with only several EM fixations can perceive and confirm the understanding of a complex visual stimulus (Fig. 1). This is an indication that human vision uses internal spatialcognitive models to infer the external world by using a top-down cognitive process that includes information from pre-knowledge, from the peripheral image glimpse, and from those half-dozen or so ROIs fixated (Noton and Stark, 1971a,b). In parallel with our studies on human EMs, we have investigated image processing algorithms, IPAs, that predict where human eyes fixate. Several IPAs were investigated ranging from statistical to structural and modeling approaches. For
0167-8655/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 0 2 ) 0 0 0 3 0 - 2
1120
C.M. Privitera et al. / Pattern Recognition Letters 23 (2002) 1119–1127
Fig. 1. Human vision. The human brain and eye collect both a low-resolution overview using the entire retina (upper right) and also small high-resolution views in the fovea (central region of the retina). With each fixation the eye jumps (saccades) from ROI to ROI (upper left) under top-down control of active looking driven by an internal cognitive model in the brain (the mind’s eye). Composite view, as achieved by the eye (lower left); original picture (lower right).
each specific IPA, ROIs are generated by applying the algorithm to the digital image and then clustering the resulting local maxima into several clusters. For each image, the centers of these clusters can be finally compared with human eye movement loci data. Several test images and human data were utilized to select a corpus of IPAs that best match eye fixations. This is fully documented in a series of studies: see for example Privitera and Stark (2000) and Privitera et al. (2000). We found that the final selection of IPAs was able to predict human fixations at about the same level that one person can predict another person’s fixations; this level is intuitively the best result we could have expected. Jpeg is the most common image format and is the scheme most widely used for compression over
bandwidth limited channels like the web or wireless communication. We therefore implemented a focused Jpeg, FJpeg, encoder or visual-compression procedure that is based on the predicted ROIs and that sits on top of Jpeg. FJpeg performs a differentiated quantization of the standard Jpeg–DCT coefficients. ROIs are kept at higher bits/pixel than the rest of the image. The fidelity of ROIs, that in total encompass only a small percentage of the picture, are maintained by preserving their Jpeg– DCT frequency coefficients and balancing this by strongly compressing the residual part of an image, peripheral to the ROIs. Thus, FJpeg uses the capabilities of Jpeg while adding an extra layer of compression onto the peripheral non-ROI areas of the picture. FJpeg visual-compression is fully Jpeg compatible in the sense that, even if a special
C.M. Privitera et al. / Pattern Recognition Letters 23 (2002) 1119–1127
encoding protocol is necessary, focused visualcompression images can be read by standard Jpeg decoders (Privitera and Stark, 1999). In this way FJpeg preserves the important visual information within the relevant ROIs of the image in an attempt to simulate human scanpaths and visual characteristics. This enables a higher visual-quality to be obtained for the same compression ratio. While visual-quality is a subjective factor somewhat idiosyncratic to each observer, yet it can be experimentally measured. Thus, the aim of this paper is to test and substantiate this focused visual-compression approach. We therefore generated a set of different standard and focused Jpeg images. The visualquality of standard Jpeg and FJpeg color images were compared to each other over a range of compression ratios using visual psychophysics experiments.
2. Experimental methods 2.1. Focused Jpeg compression In standard Jpeg compression the image is decomposed into 8 8 image pixel blocks which are transformed, by a discrete cosine transform, DCT, into blocks of frequency coefficients, fx . Each frequency coefficient is modified, or quantized, by a quantizer term Rx ¼ Qx S by the following division and rounding operations: fx ¼ bfx Rx =2c=Rx ; fx
ð1Þ
where represents the quantized frequency coefficient and the sign of is the same as the fx coefficient. The quantizer term Rx controls the lossy compression level of the image and is composed of two different factors Qx and S. The frequency-dependent factors Qx are arranged so that high frequency coefficients of the DCT transformed blocks are preferentially zeroed since these define aspects of the image that are not perceived in human vision. Thus this lossy step actually minimizes the intrusion into human visual appreciation of the image. The S term is an overall quality factor that also controls the zeroing and lossy compression of all terms. However S is in-
1121
dependent of frequency and for standard Jpeg it is identically defined for all blocks of the image. The lossy quantization is finally followed by run-length and Huffman coding, both loss-less (Pennebaker and Mitchell, 1993). In the FJpeg visual-compression procedure we exploit the factor S in a special way by assigning a particular value for each DCT block of the image. We use it as a spatial quality factor dependent upon the distance of a block, db, from the centers of each of the ROI; thus Rx;db ¼ Qx SðdbÞ. For example, SðdbÞ could be graded monotonic function equal to unity if the block coincides with the locus of one of the ROIs and then increasing with the distance from that ROI. This grading of the SðdbÞ factor would thus allow preservation of information in blocks within and near the ROIs while strongly compressing blocks in the periphery, far from ROIs. It is understood that the term periphery is used to mean the portion of the image outside the ROIs and their closely surrounding areas. In this study, SðdbÞ was defined as a step function characterized by a SðdbÞin -value for the area within an ROI and a SðdbÞout -value for the rest of the image. These two values varied, yielding different compression ratios; however, SðdbÞin was consistently much smaller than SðdbÞout , approximately 5% of SðdbÞout . The total area occupied by the ROIs was equivalent to a quarter of the entire image; the total area of ROIs was equally divided among the ROIs, even though this may vary in number in different experiments below. The compression ratio is defined by one minus the ratio between the memory occupancy of the final compressed image and the occupancy of the original image. For example, a compression ratio of 0.85 would result if the original image had 25 KB and the compressed image had 3.75 KB (Fig. 2). A further aspect of the FJpeg procedure uses color-consistency to obscure the effects of blockiness; a well-known defect in highly compressed standard Jpeg images. In FJpeg, the DC frequency coefficients, fDC , were unaltered by the quantization process, no matter what the spatially appropriate value of SðdbÞ was. In this way, for very strong levels of compression (i.e. SðdbÞout -value), although all the frequency coefficients were zeroed by the strong quantization, color information (or gray level) was
1122
C.M. Privitera et al. / Pattern Recognition Letters 23 (2002) 1119–1127
Fig. 2. Subjective comparisons of visual-quality. Note the higher visual-quality of the FJpeg images (right column) compared to standard Jpeg (middle column) for the same strong compression ratio (circa 0.85 for all three examples). Especially, the airplane image (middle row) exhibits blockiness with standard Jpeg; the color-consistency method apparently obscures this in the periphery of the FJpeg image. Similarly, the color value of the sky and water of the port scene (upper row) are also preserved with FJpeg.
always saved (note this effect acting on the extended vistas peripheral to the ROIs in the images of Fig. 2); this also contributed a consistency to maintain the overall quality of visual-compression. In the classical Jpeg framework as in FJpeg, color is represented in the YUV coordinates space (the luminance Y-channel is always left at full resolution whereas the two chrominance UV-channels are down-sampled). The DCT transform is then applied to all the YUV channels and with FJpeg color consistency is achieved by saving, for each of these YUV channels, the corresponding DC component. We conjecture that human low-resolution peripheral vision also maintains a degree of color consistency (Fig. 1, upper right and lower left).
Summarizing, the frequency coefficients, in the FJpeg visual-compression schema, are preserved in the several ROIs that in total encompass only a small percentage area of the picture. This is balanced by an extreme penalization, or simplification, of areas peripheral to the ROIs. In this way, FJpeg and Jpeg can be compared for the same level of average compression ratio. However, it is worth emphasizing that, because of the color consistency employed in FJpeg, a stronger level of compression in the periphery does not necessarily correspond to as much degradation of the visual quality as inherent in standard Jpeg (which does not employ any color consistency schema).
C.M. Privitera et al. / Pattern Recognition Letters 23 (2002) 1119–1127
2.2. Standard Jpeg decoding after focused compression In the standard Jpeg procedure, the quantizer factors matrix Rx is part of the header of the compressed file. The decoder needs this matrix to quantize in an appropriate inverse manner the DCT coefficients: fx ¼ fx Rx . In order to make the FJpeg file readable from any standard Jpeg decoder, we must implement a focused pre-dequantization immediately after the focused quantization: fx;db ¼ fx;db Rx;db :
ð2Þ
The focused pre-dequantization can be computed during the encoding since all information regarding ROIs and the stepwise function is available. The pre-dequantization returns non-quantized coefficients that can finally be slightly quantized again using the standard Jpeg quantization protocol; many of these coefficients are likely the ones outside the ROIs and were already rounded to zero by the previous focused quantization. Coefficients within ROIs are also quantized but they survive this second and final moderate quantization. Due to the widespread presence of the Jpeg decoder, this compatibility allows easy application to web and wireless communications. 2.3. Protocols A collection of more than one hundred images was used in two main experiments. The images were first compressed by standard Jpeg software to different levels of compression; visual-quality was of course a function of the compression ratio. The same images along with corresponding preselected ROIs were also processed by FJpeg for different values of SðdbÞin and SðdbÞout . During the experiments, subjects (a group of 12 different subjects naive with respect to the purpose of the project) were asked, using written instructions, to rate in Experiment-R, or to compare in Experiment-C, compressed images displayed on a computer monitor. In Experiment-R, the subjects were simply asked to look at a picture (displayed in sequence) and then to rate it using a scale display on a scroll-bar in a GUI moved by the computer mouse.
1123
In Experiment-C, the standard and visual-compression images with the same compression ratio and memory occupancy were simultaneously displayed in the screen as a pair of images, sideby-side. Both pictures were progressively updated with a continually decreased compression ratio. This occurred every 3 s with brief half-second gaps in the picture presentation sequence. The subjects’ task was to click, using the computer mouse, on the image that she thought was more resolved and easier to view and understand. The click signifying preference had to occur within 3 s of the display of the pair of images. If no click occurred within this 3-s time period the subjects were considered to have expressed no preference. Within less than 30 s the pair of pictures would have progressed to the lowest compression ratio of 0.3 with the highest quality. Thus the subjects had an opportunity to express a preference for one of the two comparison images at six different levels of compressions or to express no preference. Of course the pair of pictures, visualcompression and standard compression were leftright randomly alternated at each 3-s period. The experiment immediately continued with another picture and its sequence of six decreasing compression ratios. This Experiment-C with its comparison protocol was easy for the subjects to learn; they had no difficulties in developing a preference in less than 3 s. The procedure also made for an efficient experiment for large numbers of displayed images. It should be understood that this protocol was designed with future web and wireless application in mind. Under such real conditions progressive updating of highly compressed pictures can be accomplished so that extreme compression is only necessary to provide visual quality under high compression for a brief period of time. The schema is similar to the standard progressive Jpeg schema except that the progression is biased, in our case, to the ROIs and their surrounding areas. A progressive form of the visual-compression schema would first (rapidly) transmit the initial and strongly compressed picture and then it would expand, over time, the detail of the transmitted picture toward a final complete one with original bits/pixel. (Prof. Derek Hendry helped with his incisive discussions during the planning of the above validation protocols.)
1124
C.M. Privitera et al. / Pattern Recognition Letters 23 (2002) 1119–1127
3. Experimental validation This main question guided us in designing the two experiments: is FJpeg visually superior to standard Jpeg at the same level of compression? It was clearly necessary to carry out both the rating and the comparison experiments over a wide range of compressions.
3.1. Single presentations and rating judgements In Experiment-R (Fig. 3), the subjects participating in the experiment were asked to rate the visual quality of single images presented sequentially on the computer screen. The subjects
could express a visual quality rating from 0 to 100 (Fig. 3, ordinate) and enter the value for that image using a horizontal scroll-bar located above the displayed image. This was repeated for the entire set of 180 images, 30 pictures with six level of compressions for each (Fig. 3, abscissa). This collection of variously compressed images was presented randomly and sequentially until the subject had adjusted the ratings for all pictures and all compression ratios. For the entire set of 180 images the experiment took about 8 min. Visual-compression was remarkably superior in visual quality rating to standard Jpeg compression for the whole range of compression ratios up to the point at 0.4 where it was difficult to fault either of the two compression schemes.
Fig. 3. Ratings of visual-quality. Experimental ratings of FJpeg visual-compression images against standard Jpeg ones. Subjective ratings for visual quality (ordinate) of single images plotted as a function of the varied compression ratio (abscissa). Visual-compression (solid line); standard Jpeg (dashed line). Note superiority in FJpeg visual quality for visual-compression as compared with standard Jpeg at same compression ratios.
C.M. Privitera et al. / Pattern Recognition Letters 23 (2002) 1119–1127
3.2. Side-by-side comparisons In Experiment-C, visual-compression and standard compression were displayed together for the same level of compression, side-by-side on the screen (Fig. 4). The compressions were at first strong (right-hand side of Fig. 4) and then progressively weaker over time (toward the left-hand side of Fig. 4). The subjects were asked to select the better picture for each presentation during this temporal progression. Visual-compression was always preferred for strong compression levels. For middle levels, visual-compression maintained its superiority through all the level of compressions up to the point where, for minimal compression,
1125
the two methods were similar and a large number of no preferences prevailed. 3.3. Control experiments How many ROIs? In similar evaluation experiments, we studied two different type of images: sparse and not-sparse images. The images were analyzed, one by one, by the authors and then classified based on their judgments. Also, a new series of pictures was utilized. Sparse images were without much detail, and therefore ones for which a single ROI appeared sufficient for visual understanding. Not-sparse images had more complexity and visual details and thus they required more
Fig. 4. Side-by-side comparisons of visual-quality. Standard Jpeg and FJpeg visual-compression images with the same compression ratio (and memory occupancy) were simultaneously displayed in the screen with matched, progressively decreasing, compression ratios. Visual-compression was preferred at all the levels of compression up to the point where, for minimal compression, the two methods were similar and a number of no preferences prevailed. Note that the direction of progressive presentation is from right to left in this figure.
1126
C.M. Privitera et al. / Pattern Recognition Letters 23 (2002) 1119–1127
ROIs to capture all the distributed visual information. We wanted to verify this intuitive consideration and to identify the optimal number of ROIs in this case. The same set of not-sparse images were compressed both with Jpeg and then with FJpeg using 1–4 ROIs and then shown, according to the rating experimental protocol, to the subjects. A compromise between the number of ROIs and the size of ROIs needed to be set in order to allow the same level of compression and the same area for different numbers of ROIs. For the cases with 1 or 2 ROIs, visual-compression appeared undoubtedly superior to standard compression. Quite surprisingly for the cases with three and four ROIs, visual-compression appeared not to be so superior as compared with the standard compression; thus fewer (and larger) ROIs were preferred to many (and smaller) ROIs. As a final control experiment we also used random visual-compression. In this case, ROIs were selected randomly over the area of the picture and then the FJpeg encoder applied. Our results show that the random visual-compression performs distinctly worse than visual-compression even if slightly better than standard compression. How much is this superiority sensitive to the correct location of the ROIs? Our findings was that the correct location of ROIs is an important aspect of the method. However, even a random distribution of ROIs seems to be enough to augment perceived image quality over standard Jpeg compression. Thus these random visual results also served to buttress the superiority of visual-compression. Note that all experiments were repeated several times for each subject. The response of each subject changed approximately 25% of the time from repetition to repetition and for the same picture and level of compression. This variance might be due to an effect of familiarization with the viewed images.
4. Conclusions Our visual-compression approach rests on an important property of vision (and of pictures that are designed appropriately to being easily viewed by humans). The visual content of an image can in
part be condensed, by image processing algorithms, IPAs, into a representative small number of regions-of-interest, ROIs, distributed over the image. These ROIs represent much of the information of the image and they are necessary to permit recognition the image. Normal human vision works in a similar fashion; several eye movement fixations move the fovea (the high-resolution region of the retina, less than two degrees of the visual angle) onto specific loci which provide the brain with the necessary information to enable recognition. Our visual-compression method provides to the human viewer several ROIs with a low compression ratio while allowing a high compression ratio for the major expanse of the image. This FJpeg method stands in contrast to usual compression methods like Jpeg that compress an image equally everywhere. How is it possible to quantify visual quality? Our answer lies in the careful design of the two experiments described in Section 3 wherein a group of subjects were asked to rate or to compare standard Jpeg or FJpeg visual-compression images shown on the computer screen using a dedicated GUI. We generated a collection of compressed images to enable this comparison of FJpeg and Jpeg at different levels of compression and thus of visual quality. One of our experimental design problems was obtaining adequate data in spite of the brain’s ability to overcome many defects in a image to obtain satisfactory visual content. The results of these comparisons document several important aspects of human and computational vision. Our main result was that FJpeg visual-compression was superior to standard Jpeg because of the brain’s ability to take advantage of relatively undistorted ROIs in spite of the high compression ratio of the larger expanse of the picture. We studied a wide range of compression ratios and this major finding was evident for strong compression. Apparently, a large areas of the image can be reduced to only one color-consistent DC Jpeg coefficient without upsetting the viewer since the visual brain is, at the same time, busy obtaining information from the ROIs. Even preserving a very few ROIs with the same total picture area appears to be satisfactory.
C.M. Privitera et al. / Pattern Recognition Letters 23 (2002) 1119–1127
Alternative solutions have been proposed in the literature (see for example Kundu (1995) and Rosenholtz and Watson (1996)); ROIs are usually identified at the very small block level and severe discontinuities can appear for strong compression even with selective quantization. Also, they may not be easily compatible with standard compression formats. An interesting proposal is discussed in (Zhao et al. (1995)) where a fuzzy schema is used to determine important regions. Our method is based on the selection of tested IPAs that can identify a small number of compact ROIs whose size and distribution over the image is based on and strongly inspired by intensive human eye movement studies. Indeed, as our experiments with subjective visual-quality demonstrated, the biological plausibility of our approach resulted in an obviously better qualitative impression of the FJpeg visual-compression. The computational format of FJpeg visualcompression sits on top of other methods such as standard Jpeg. Besides standard Jpeg, FJpeg visual-compression is well able to employ Jpeg 2000. The wavelet formulation lends itself very well to differential compression of ROIs and we already have an interesting preliminary implementation of visual-compression based on wavelet decomposition; the results are very promising and we are looking forward to a complete and full integration with the ISO/IEC Jpeg 2000 standard. It can also effectively utilize Mpeg and especially Mpeg 4 and 7 with their object definitions. With the explosive use and expected developments of internet wireless devices, a critical need exists for image compression to utilize efficiently the available bandwidth,
1127
especially for very strong compression. Our strong results at a high compression ratio suggest the applicability of FJpeg in this exciting area (Privitera et al., 2001).
References Kundu, A., 1995. Enhancement of Jpeg coded images by adaptive spatial filtering. In: Proc. Internat. Conf. on Image Process., Vol. 1. IEEE Computer Soc. Press, Washington, DC, pp. 23–26. Noton, D., Stark, L.W., 1971a. Eye movements and visual perception. Sci. Am. 224, 34–43. Noton, D., Stark, L.W., 1971b. Scanpaths in eye movements during pattern perception. Science, 308–311. Pennebaker, W., Mitchell, J., 1993. Jpeg still image data compression standard. Van Nostrand, Princeton, NJ. Privitera, C.M., Azzariti, M., Stark, L.W., 2000. Locating regions-of-interest for the Mars rover. Internat. J. Remote Sensing 21 (17), 3327–3347. Privitera, C.M., Stark, L.W., 1999. Focused Jpeg encoding based upon automatic pre-identified regions-of-interest. In: Proc. SPIE, San Jose, CA, Vol. 3644, pp. 552–558. Privitera, C.M., Stark, L.W., 2000. Algorithms for defining visual regions-of-interest: comparison with eye fixations. IEEE Trans. Pattern Anal. Machine Intell. 22 (9), 970–981. Privitera, C.M., Stark, L.W., Ho, Y.F., Weinberger, A., Azzariti, M., Siminou, K., 2001. Vision theory guiding web communication. In: Proc. SPIE – Invited Paper, San Jose, CA, Vol. 4311, pp. 53–62. Rosenholtz, R., Watson, A., 1996. Perceptual adaptive Jpeg coding. In: Proc. Internat. Conf. on Image Process., Vol. 1. IEEE Computer Soc. Press, Lausanne, Switzerland, pp. 901–904. Zhao, J., Shimazu, Y., Ohta, K., Hayasaka, R., Matsushita, Y., 1995. A Jpeg codec adaptive to the relative importance of regions in an image. In: Proc. Internat. Conf. on Image Process., Vol. 1. IEEE Computer Soc. Press, Washington, DC, pp. 23–26.