Pattern Recognition Letters 19 Ž1998. 77–88
Using models of feature perception in distortion measure guidance Xose R. Fdez-Vidal a , J.A. Garcıa ´ a b
b,)
, J. Fdez-Valdivia
b
Departamento de Fısica Aplicada, Facultad de Fısica, UniÕersidad de Santiago de Compostela, 15706 Santiago de Compostela, Spain ´ ´ Departamento de Ciencias de la Computacion UniÕersidad de Granada, 18071 Granada, Spain ´ e I.A., E.T.S. de Ingenierıa ´ Informatica, ´ Received 26 October 1996; revised 10 September 1997
Abstract In this paper we present three error measures based on feature perception models, in which pixel errors are computed on locations at which humans might perceive features in the reference image. In the first part of this work, the three schemes of feature detection will be discussed and evaluated in terms of their performance for a simple visual signal-processing task. The first model is based on the use of local intensity gradients, the second based on the use of phase congruency in an image, and the third based on the use of local energy maxima for a few active sensors under a multichannel organization of the reference picture. In the second part of this paper, examples are provided of object detection and recognition applications that illustrate the ability of the induced error measures to predict the detectability of objects in natural backgrounds as well as their perceptual capabilities. q 1998 Elsevier Science B.V. Keywords: Feature perception; Error measure; Laplacian zero-crossings; Phase congruency; Energy function
1. Introduction This work is intended to analyze the relation between two different problems in Computer Vision: the first, what a proper model for identifying significant features in a image is; and the second, what the qualitative differences in performance of distortion measures are, in the case the images to be processed are compared in a selective rather than in a pointby-point way. The natural relation between both problems arises from the fact that a proper selection of significant features in the reference image might
)
Corresponding author. E-mail:
[email protected].
be used to guide its comparison with another through any reasonable metric. What motivated this idea is the fact that even though the most common distortion measure is the mean square error Žm.s.e.. in intensity, the well-known arguments about its drawbacks and uncertain relationship with perceived image quality are still valid, and there are enough reports in the literature to confirm this ŽBudrikis, 1972; Carl, 1987; Karunasekera and Kingsbury, 1995.. Although the m.s.e. has a good physical and theoretical basis, this measure is often found to correlate very poorly with subjective ratings due to the fact that the human visual system does not process the image in a point-by-point manner but rather in a selective way according to the decisions made on a cognitive level, by choosing specific data
0167-8655r98r$19.00 q 1998 Elsevier Science B.V. All rights reserved. PII S 0 1 6 7 - 8 6 5 5 Ž 9 7 . 0 0 1 5 0 - 5
78
X.R. Fdez-Vidal et al.r Pattern Recognition Letters 19 (1998) 77–88
on which to make judgments and weighting this data more heavily than the rest of the image ŽUttal, 1988; Wandell, 1995.. To overcome this problem, different weightings have been proposed, for example logarithmic and cube-root ones. Furthermore, both preprocessing the pair of images and emphasizing their edge content have been suggested as well. In this paper, we study a different approach to improve the correlation between subjective rating and the m.s.e. in which the differences of the images to be compared are calculated on locations at which humans might perceive features in the reference image, for example, line features or step discontinuities. So a visual model for feature perception is used to measure distortion between a distorted image and its original. The actual success of the resulting distortion measure would then depend on both the validity of the vision model and which error metric was used in the perceptual domain. But some experimental results indicate that the error metric is less important than the choice of the visual model. Faugeras Ž1979. has shown experimentally that images do not change in their relative rankings for three different error metrics which might be applied in the perceptual domain used: the maximum absolute deviation, the mean absolute deviation, and the root mean square error Žr.m.s.e.. – to make the first two quantities and the mean square error comparable, the square root needs to be taken of the m.s.e. Of course, the first problem to be solved is that of deriving a model of feature perception capable of successfully explaining a reasonable number of psychophysical effects in human feature perception. Three different options are to be discussed in Section 2: one based on the use of local intensity gradients, another based on the use of the maximum phase congruency in an image, and the third one based on the use of local energy maxima for a few active sensors under a multichannel organization of the reference picture. In Section 3, we proceed to reformulate r.m.s.e. to process images at locations of features as extracted by each of the three models. To examine the effectiveness of the distortion measures proposed here, in Section 4 their performance have been evaluated for a number of test patterns showing the differences with the standard root mean square error and with
subjective rankings. There are several characteristics to look for when choosing a distortion measure: the measure should be zero when the reference and input image are identical, never be negative, and be monotonically increasing as the input image ‘‘looks worse’’. The measure derived by using each model of feature perception, will satisfy the first two criteria. So the interesting point is to present data regarding the behavior of the measures with respect to the third criteria. In fact, as noted above, the rmse says much of the quantity of per-pixel error, but nothing of the quality of distortion. So the question is: What is the status of the derived measures as distortion criteria? In other words: Do the devised mathematical measures capture any of the visual impressions of a human observer? Section 5 presents an answer to this question.
2. Computational models for feature detection The human visual system allocates different amounts of processing resources to different portions of the visual field which provides a trade-off between resources and time. On the one hand, attention can be shifted to a new location through a saccadic eye movement. On the other hand, the photoreceptor density that decreases between the fovea and the periphery induces no-uniform processing capability over the entire field. In fact, the conclusion is still more surprising: features will only be perceived if they success in attracting attention ŽRock, 1995.. The important point is then what kinds of features in an image seem to draw the subject’s attention and thus become concious, and a great deal of biological vision research has addressed such a problem. This section is divided in three parts: the first describes a model for identifying features based on the encoding of the extrema in the blurred spatial derivatives of the image ŽMarr and Hildreth, 1980.; the second, the local energy model of feature detection ŽMorrone and Owens, 1987; Morrone and Burr, 1988; Kovesi, 1995.; and the third presents a method for identifying features based on the local maxima of energy maps for a few active sensors for a multichannel partition of the reference image ŽFdez-Valdivia et al., 1996.. In the latter part of this section, an example is
X.R. Fdez-Vidal et al.r Pattern Recognition Letters 19 (1998) 77–88
given of a visual task that the two energy-based models can perform but that cannot be performed in a differential scheme. 2.1. Image features from Laplacian zero-crossings Marr Ž1982. provides evidence suggesting that the retinal operation on an image might be described analytically as the convolution of the 2D image with the Laplacian-of-Gaussian ŽLoG. operator, = 2 Gs , of various sizes s . Also concluded that surface markings or object edges stand in a close relation to scene entities and so they are important cues in the image interpretation. In conclusion, according to this derivative-based model of early image representation, the most significant features are the zero-crossings of multiscale LoG, that is, the points Ž x, y . such that
= 2 Gs Ž x , y . ) I Ž x , y . s 0,
79
tures, both lines and edges, are then signaled by peaks in local energy functions. In fact, energy is locally maximum where the harmonic components of the stimulus come into phase – see ŽMorrone and Burr, 1988. for proof. The implementation of the local-energy model used here is that presented in ŽKovesi, 1995., and so the calculation of phase congruency in 2D images is performed using Gabor wavelets. To detect features at all orientations the bank of filters is designed so that they tile the frequency plane uniformly. In the frequency plane the filters appear as 2D Gaussians symmetrically or anti-symmetrically placed around the origin, depending on the spatial symmetry of the filters. The length to width ratio of the 2D wavelets controls their directional selectivity, and this ratio may be varied in conjunction with the number of filter orientations used in order to achieve an even coverage of the 2D spectrum.
Ž 1.
with Gs being the 2D Gaussian smoothing operator that suppresses the influence of pixels further than 3 s from the current pixel, and with = 2 being the Laplacian, an isotropic second-derivative operator. The resultant zero-crossings are used to detect and localize significant features such as edges or occlusion boundaries. 2.2. Image features from phase congruency Developing further the concept of specialized detectors for both mayor types of image features, lines and edges ŽMorrone and Owens, 1987; Morrone and Burr, 1988. proposed a local-energy model of feature detection. This model postulates that features are perceived at points in an image where the Fourier components are maximally in phase, and successfully explains a number of psychophysical effects in human feature perception ŽMorrone and Burr, 1988.. It is interesting to note that this model predicts the conditions under which Mach bands appear, and the contrast necessary to see them. To detect the points of phase congruency an energy function is defined. The energy of an image may be extracted by using the standard method of squaring the outputs of two filters that are in quadrature phase Ž908 out of phase. ŽGabor, 1946.. Fea-
2.3. Image features from actiÕe sensors This third approach would extract features from the viewpoint of a few active sensors selective to both orientation and spatial-frequency, under a multichannel organization ŽFdez-Valdivia et al., 1996.. The spatial locations worth noting regarding the image representation may be the local maxima of the energy map, for each strongly responding unit selective to 2D spatial-frequency. The first problem to be solved is the selection of an appropriate set of sensors, which is a central issue in multi-channel approaches. On the 2D spatial frequency plane, the superposition of a number of data-driven spatial-frequency channels on a set of orientation channels produces the desired organization. Following the efficiency of the model human image code given by a number of authors, as well as the biological evidence that demonstrates that the median orientation bandwidth of visual cortex cells is about 40 deg, the orientation channels are selected with orientations of 08, 458, 908 and 1358, namely C0 , C1 , C2 and C3 , which correspond to an orientation bandwidth of 45 deg. To derive the sensors, each of the four orientation channels Ci , i s 0, 1, 2, 3, with an orientation of i = 45 deg, respectively, need to be partitioned into a number of spatial-frequency bands. This is accomplished by using an index for
80
X.R. Fdez-Vidal et al.r Pattern Recognition Letters 19 (1998) 77–88
the orientation channel Ci , noted as hC iŽ rsup ., that indicates the relative amount of spectrum folded back into the 2D Fourier domain, given by the spatial-frequency band Ž0, rsup . within that orientation channel, Ci , after the sampling rate reduction to rsup – see ŽFdez-Valdivia et al., 1996. for further details. Once the partition of the 2D Fourier plane into a number of sensors has been carried out, the significance of the sensor response is analyzed by classifying sensors into two classes: the active sensors and the non-active ones. The only sensors worth noting regarding the feature detection, are those that exhibit a strong response from the significant structures in the image, namely the active sensors. Each sensor should be described by a sensor measure that can successfully characterize it. Fdez-Valdivia et al. Ž1996. propose one feature derived from the summation of the normalized 2D power spectrum over the sensor. Normalization is a non-linear operation, where each sensor’s 2D power spectrum – the sensor response to the stimulus – is divided by the total power spectrum of the image. The effect of normalization is that the response of each sensor is rescaled with respect to the pooled activity of all the sensors. According to this model, a given sensor might be suppressed by the other sensors, including those with perpendicular orientation tunings. In the formulation proposed in ŽFdez-Valdivia et al., 1996., cluster analysis is then used to group sensors together, since unsupervised learning may exploit the statistical regularities of the sensors by using the available sensor responses. Finally, the locations of the local energy map features, for each active sensor, are defined as rea-
sonable candidates for locations where the visual system perceives something of interest. For each active sensor Ch i , let JCh iŽ x, y . be the image filtered by a complex 2D Gabor filter, as given in equation JCh i s r ) g Ž sCh i , r Ch i , u Ch i . ,
Ž 2.
with the filter parameters set to the global scale of Ch i , sCh i , its central spatial-frequency r Ch i , and its central orientation u Ch i – see ŽFdez-Valdivia et al., 1996. for further details about how these parameters can be obtained. Then the local energy map from the viewpoint of the sensor Ch i , noted as Lem Ch iŽ x, y ., is defined by Lem Ch iŽ x , y . s < JCh iŽ x , y . < 2 .
Ž 3.
To model as closely as possible the known properties of the human visual system, the local energy map is assumed to be calculated separately over the active sensors of the derived organization with the respective complex Gabor filter g Ž sCh i , r Ch i , u Ch i ., at the sensor parameters sCh i , r Ch i and u Ch i . This is equivalent to saying that the original image r Ž x, y . is filtered by a quadrature pair having the same amplitude spectra by using the sine Žodd-symmetric. and cosine Ževen-symmetric. versions of the same complex filter g Ž sCh i , r Ch i , u Ch i .: even JCh s r ) g even Ž sCh i , r Ch i , u Ch i . i
Ž 4.
and odd JCh s r ) g odd Ž sCh i , r Ch i , u Ch i . , i
Ž 5.
where ) denotes the convolution operator; with g even and g odd being the even-symmetric and odd-sym-
Fig. 1. Four examples of the texture family. All have the same value of u 0 . The image in Ža. has m s 0. Žb. – Žd. show images having values of Õ 0 equal to u 0 r2, u 0 or 2 u 0 .
X.R. Fdez-Vidal et al.r Pattern Recognition Letters 19 (1998) 77–88
81
ings model described above were a correct description of human early visual pattern. Let us consider a family of images in the space domain described by I Ž x , y . s L 0 q cos Ž u 0 x . q
m 2
qcos Ž u 0 x y Õ 0 y . , Fig. 2. Laplacian zero-crossings for the four images in Fig. 1.
metric Gabor filters. In fact, to calculate the Lem Ch i map, the outputs of the two filters that make up the pair need to be squared and summed: Lem Ch iŽ x , y . s
even JCh i
2
Ž x, y. q
odd JCh i
2
Ž x, y. . Ž 6.
Consequently, the local energy map Lem Ch i provides us with an image representation in the space spanned even odd by the two functions, JCh and JCh . Hence, the i i detection of peaks on the Lem Ch i map acts as a detector of the locations of significant features from the viewpoint of Ch i . 2.4. Texture discrimination The purpose of this section is to show a simple information-processing operation that being apparent in human pattern can be shown to be impossible in the Laplacian zero-crossings, but possible in the latter energy-based schemes. In fact, Daugman Ž1988. presented a particular texture discrimination task that the human visual system can perform effortlessly and showed that this task would be impossible if the Laplacian zero-cross-
cos Ž u 0 x q Õ 0 y .
Ž 7.
where < m< -
u 20 u 20 q Õ 02
and with L 0 being the mean luminance Žgreater than 2.; u 0 and Õ 0 arbitrary. Fig. 1 shows four examples of this family. All have the same value of u 0 . The image in Fig. 1Ža. has m s 0. Fig. 1Žb–d. shows images having values of Õ 0 equal to u 0r2, u 0 or 2 u 0 , corresponding to oriented sidebands whose frequencies are 1.11, 1.41 and 2.23 times higher than the carrier frequency. In fact, they are readily distinguishable to a human observer. The point is that, as shown in ŽDaugman, 1988., when such textures are convolved with the Laplacian-of-Gaussian ŽLoG. operators, = 2 Gs , of various scales s , they all emerge with identical zero crossings and thus could not be distinguished perceptually on that basis. The only zero-crossings are those of the cosŽ u 0 x . term Žsee Fig. 2.. But each image I Ž x, y . is the superposition of three cosine waves: cosŽ u 0 x ., cosŽ u 0 x q Õ 0 y . and cosŽ u 0 x y Õ 0 y .. On the one hand, the spatial frequency of cosŽ u 0 x . is u 0 ; and that of the latter cosine terms is u 20 q Õ 02 . On the other hand, the first
(
Fig. 3. Ža., Žb., Žc. The points of maximum phase congruency for images in Fig. 1Žb,c,d., respectively. They are computed as described in Section 2.2.
82
X.R. Fdez-Vidal et al.r Pattern Recognition Letters 19 (1998) 77–88
cosine term has 0 deg Fourier phase, and the latter two terms have arctanŽ Õ 0ru 0 . and arctanŽyÕ 0ru 0 . Fourier phase, respectively. That is, the phase of the latter two cosine terms depends on both vector frequency components u 0 and Õ 0 . Hence, when the value of Õ 0 changes to produce different examples of the texture family, the points at which the cosine waves meet also change, and consequently arrival phases of the cosine components will be most similar at different locations in the image. Because of both the prominent visual features of I Ž x, y . are found at point of greatest phase agreement, and each I Ž x, y . have different points of maximum phase congruency Žlocations at which the corresponding cosine components meet., they would have different features which makes them distinguishable to a phase-congruency scheme. Fig. 3Ža. Žrespectively, Fig. 3Žb,c.. shows features for image in Fig. 1Žb. Žrespectively, Fig. 1Žc,d., as detected using the energy model described in Section 2.2 Žrecall that the local maxima of the energy function occur at points of maximum phase congruency.. On the contrary, the local energy profile for the single sinusoid in Fig. 1Ža. is flat, and so no peak occurs in this waveform Žas shown in ŽOwens, 1994., this is an example of feature-free image.. Summarizing, different images have distinct features. Fig. 4 shows features computed as described in Section 2.3. More precisely, Fig. 4Ža. presents the features from the only active sensor for image in Fig. 1Ža.. Each of the images in Fig. 1Žb,c,d. has two active sensors and the corresponding features are shown in Fig. 4Žb,c., Fig. 4Žd,e. and Fig. 4Žf,g., respectively. For different images, this model of feature detection produces distinct features. The problem just described is only an example, but it shows one of the most important constraints in early vision for representing image structures: the physical process underlying image formation are typically smooth. And while locations of sharp changes in image intensity are commonly localized using Laplacian zero-crossings, for images without sharp changes in intensity, zero-crossings in the Laplacian are missing Žfor example, think of an orthographically projected image of a sphere with Lambertian reflection function and parallel illumination; or the superposition of a sinewave grating onto a paraboloid ŽDaugman, 1988.. On the contrary, a phase-con-
Fig. 4. Ža. The features detected from the only active sensor for image in Fig. 1Ža.; Žb–c., Žd–e. and Žf–g. the features detected from the two active sensors for images in Fig. 1Žb., 1Žc. and 1Žd., respectively.
gruency model is not based on the use of local intensity gradients for feature detection, instead the features are perceived at points in an image where the Fourier components are maximally in phase. These points are locations at which local energy peak, and while the local energy profile is flat only
X.R. Fdez-Vidal et al.r Pattern Recognition Letters 19 (1998) 77–88
for single sinusoids, peaks occur in all other waveforms – see ŽOwens, 1994. for further details about feature-free images.
3. Error measure guidance All common error measures quantify the error present in an input image making use of the residue, that is, the input image subtracted from the original. In such a way, the mean square error Žmse. averages the squares of pixel differences:
83
tween corresponding image pixels. In such a way, each of the feature perception models described above induces an error measure. The first error measure proposed attempts to improve on the above equations by computing pixel errors on neighborhoods of Laplacian zero-crossings for the reference image O. To be precise, it calculated as rmse ZC s
(
1 Card w Wzc Ž O . x
Ý
w O Ž x , y . y I Ž x , y . x2 ,
Ž x , y .gW zc Ž O .
Ž 10 . mse s
N
1 NM
M
Ý Ý
2
OŽ x, y. yIŽ x, y. ,
Ž 8.
where Wzc Ž O . is defined as
xs1 ys1
Wzc Ž O . s Ž x , y . g W Ž i , j . N Ž i , j . with O and I being the reference image and the input image, which are N = M pixels in size. Taking the square root is one way of reducing the range of values. In which case Eq. Ž8. becomes
rmse s
)
1 NM
N
M
Ý Ý
OŽ x, y. yIŽ x, y.
2
. Ž 9.
xs1 ys1
In the rmse, large pixel errors have a greater contribution in the error. On the other hand, the images are compared in a systematic way, pixel by pixel. This error measure designed to answer ‘‘Are these two images different?’’ should predict answers to the question ‘‘Is there an object in this image?’’. But object detection involves looking for one of a large set of object sub-images in a large set of background images. So for accomplishing this task seems to be necessary to make use of features, that stand in close relation to image objects, to compute disparity in image intensity between corresponding image points. And obviously, this task cannot be performed by using a systematic point-by-point approach as rmse postulates. To overcome this problem, it is proposed that in order to devise alternate measures that better capture the response of the human visual system, we should use a feature detection model for identifying significant locations at which to measure differences be-
is a Laplacian zero-crossing for O ; W Ž i , j . is a neighborhood of Ž i , j . 4 . We compute pixel errors on neighborhoods of zerocrossings rather than the locations of zero-crossings taking into account that the photoreceptor density in the human retina decreases between the fovea and the periphery. In the following, the neighborhood W Ž i, j . will be defined and analyzed. The second error measure proposed computes pixel errors on neighborhoods of image features from phase congruency for the reference image O: rmse PC s
(
1 Card w Wpc Ž O . x
Ý
w O Ž x , y . y I Ž x , y . x2 ,
Ž x , y .gW pc Ž O .
Ž 11 . where Wpc Ž O . is defined as Wpc Ž O . s Ž x , y . g W Ž i , j . N Ž i , j . is a point of maximum phase congruency for O ; W Ž i , j . is a neighborhood of Ž i , j . 4 . The neighborhood W Ž i, j . in Eq. Ž10. and Eq. Ž11., was defined as the pixels contained in a digi-
84
X.R. Fdez-Vidal et al.r Pattern Recognition Letters 19 (1998) 77–88
tized disk of radius r centered at Ž i, j .. In the next section, the stability of the resultant error measures to different values of the radius disk r will be analyzed. Finally, we propose a measure in which the pixel errors are computed on neighborhoods of local energy maxima from active sensors for the reference image O. This error measure is defined as rmseAS s max rmse Ch i N Ch i g Active 4 ,
Ž 12 .
distortion from the viewpoint of the active sensor Ch i as rmse Ch i s
(
1 Card w WCh iŽ O . x
w O Ž x , y . y I Ž x , y . x2 ,
Ý Ž x , y .gW Ch Ž O . i
Ž 13 . where WCh iŽ O . is defined as WCh iŽ O . s Ž x , y . g W Ž i , j . N Ž i , j . is a feature from active sensor Ch i ;
with Active being the subset of active sensors for the reference image O, and where rmse Ch i measures the
W Ž i , j . is a neighborhood of Ž i , j . 4 . In this equation, the neighborhood W Ž i, j . is defined
Fig. 5. The six image pairs used in the first experiment. Each pair shows an object image and a non-object image.
X.R. Fdez-Vidal et al.r Pattern Recognition Letters 19 (1998) 77–88
85
Table 1 Experimental comparison of the six image pairs in Fig. 5. Column 1: Subjective ranking and clusters. Column 2: rmse error. Columns 3, 4 and 5: Image distortion by the derived error measures Distinctness of targets and their immediate surround Image pair a1 a3 a4 a2 a5 a6
Subjective ranking
rmse
Cluster 1
Value 6.626 7.096 6.748 6.680 5.611 5.350
2 3
Order 1 2 3 4 5 6
rmseZC Rank 4 1 2 3 5 6
Value 6.762 7.298 7.090 6.907 5.964 5.756
as the pixels contained in a disk of radius r centered at Ž i, j ., with the radius disk r being r s dwŽ i m , j m .;Ž i, j .x, where Ž i m , j m . is the nearest local minimum to Ž i, j . on the energy map Lem Ch i defined as given in Eq. Ž6., and with dwP,P x being the Euclidean distance. The nearest local minimum to Ž i, j . on the local energy map marks the beginning of another potential structure.
4. Experimental results To place the three distortion measures under scrutiny, they have been tested extensively in object
rmse P C Rank 4 1 2 3 5 6
Value 16.468 10.422 11.106 10.166 8.302 7.341
rmse A S Rank 1 3 2 4 5 6
Value 12.508 9.536 10.016 7.642 0.000 4.535
Rank 1 3 2 4 6 5
detection and recognition applications. In the following, we present our results in two parts. 4.1. Distinctness of targets and their immediate surrounds The images in this first experiment were six greylevel images of natural scenes containing a single vehicle located in the middle of the image, and the six corresponding scenes without the vehicle. The vehicles are of different degrees of visibility. All the images are 128 = 128 pixels in size, and they are shown in Fig. 5. In this first experiment we are interested in the relative distinctness of the targets
Fig. 6. Plots that illustrate the image-pair discriminability measured by the resultant rmse PC for values of r set to 0, 1, 2, 3, 4. The rmse PC error is shown on the vertical axis, and the six different image pairs on the horizontal.
86
X.R. Fdez-Vidal et al.r Pattern Recognition Letters 19 (1998) 77–88
and their immediate surrounds computed with the three proposed measures. The rank order based on the area under the cumulative detection curve for ten human subjects is listed in Table 1. There are indeed 3 clusters in Fig. 5 Ž‘‘cluster’’ means a group of closely spaced, largely overlapping, cumulative distribution curves. : Image-Pair a1, Image-Pair a3, Image-Pair a44 ; the second, Image-Pair a24 ; the third, Image-Pair a5, Image-Pair a64 . Permutations within clusters are not serious since the corresponding cumulative detection curves cross each other. The rank order by the three proposed error measures as well as the standard rmse are also shown in Table 1. Summarizing, it seems that whereas the rmse and the rmse ZC give a poor measure of distinctness quality, the measures rmse PC and rmseAS are good predictors of target saliency for humans performing visual search and detection tasks. The neighborhood W Ž i, j . in Eq. Ž10. and Eq. Ž11., was defined as the pixels contained in a digitized disk of radius r centered at Ž i, j .. For values of r set to 0, 1, 2, 3, 4, the plots in Fig. 6 illustrate the image-pair discriminability measured by the resultant rmse PC . The rmse PC error is shown on the vertical axis, and the six different image pairs on the horizontal. For values of r greater than 2, the resultant rankings give the same good correlation with subjective ranking. 4.2. Detectability of a basic pattern In this second experiment, six degraded images were used. First, in order to produce three of these images, the 256 = 256 ‘‘Einstein’’ image in Fig. 7 was divided into boxes of 8 = 8, 12 = 12 and 16 = 16, respectively, and printed each box in the corresponding image with the average intensity found in that box. As can be seen in Fig. 7, this has the effect of removing high frequencies; any detail smaller than the width of a box must be averaged out. It also has the effect of adding some high frequencies, namely those that define the edges of the boxes. Second, the other three degraded images were generated by filtering the original image in Fig. 7 with a Gaussian at values of the smoothing parameter set to 4.9,6.7,8.2, respectively.
Fig. 7. The left column shows the result of the computer processing in which all the picture frequencies higher than the repeat frequencies of the boxes Ž8=8, 12=12 and 16=16, respectively. have been removed; the right column shows the other three degraded images in the second experiment generated by filtering the original image with a Gaussian at values of the smoothing parameter set to 4.9, 6.7, 8.2.
There is no way to recover the lost high frequency of the reference image. In any case, for the 8 = 8 boxed picture, we can identify the stimulus since we
X.R. Fdez-Vidal et al.r Pattern Recognition Letters 19 (1998) 77–88
87
Table 2 Experimental comparison of the six degraded images in Fig. 6Žb–g., with the Einstein image in Fig. 6Ža.. Column 1: Subjective ranking and clusters. Column 2: rmse error. Columns 3, 4 and 5: Image distortion by the derived measures Detectability of a basic pattern Image
s s 4.9 ps8=8 s s 6.7 s s 8.2 p s 12 = 12 p s 16 = 16
Subjective ranking
rmse
Cluster
Order
Value
Rank
Value
Rank
Value
Rank
Value
Rank
1
1 2 3 4 5 6
18.572 17.711 20.714 22.081 20.044 21.395
2 1 4 6 3 5
19.180 19.062 21.197 22.409 21.122 23.094
2 1 4 5 3 6
36.029 36.780 38.883 40.379 39.202 41.999
1 2 3 5 4 6
22.690 23.323 24.831 26.189 24.962 26.687
1 2 3 5 4 6
2 3
rmseZC
can remove the additional high frequencies that made the picture information hard to see. Hence, the 8 = 8 boxed image is not great, but still recognizable. On the contrary, for the 16 = 16 boxed picture, because there is too much masking, we cannot remove enough of the masking frequencies Žthe edges of the boxes., and this image is not recognizable. The degraded images were shown to twenty human subjects who ranked them based on the level of detectability of the basic pattern. Table 2 shows the resultant subjective ranking of the images. As noted above, the neighborhood W Ž i, j . in Eq. Ž10. and Eq.
rmse P C
rmse A S
Ž11., was defined as the pixels contained in a digitized disk of radius r centered at Ž i, j .. Fig. 8 shows the resultant rmse PC for values of parameter r set to 0, 1, 2, 3, 4. The rmse PC error is shown on the vertical axis, and the six degraded images on the horizontal. For values of r greater than 0, the resultant rankings are good predictors of target saliency for humans performing visual detection tasks. From this same table, it can be seen that even though the six degraded images have approximately the same rmse, the subjective ranking correlates well with the perceptual distortion measured by rmse PC
Fig. 8. Plots that illustrate the resultant rmse PC for values of parameter r set to 0, 1, 2, 3, 4. The rmse PC error is shown on the vertical axis, and the six degraded images on the horizontal.
88
X.R. Fdez-Vidal et al.r Pattern Recognition Letters 19 (1998) 77–88
and rmseAS , but for the rmse and rmse ZC , it does not.
5. Conclusions The main conclusions of this work are twofold: the first, simple visual signal-processing tasks that phase-congruency schemes can perform effortlessly, cannot be performed using a Laplacian zero-crossings scheme; the second, a phase-congruency model of feature detection induces an error measure in the corresponding perceptual domain that improves the correlation between subjective rating and the original pixel-by-pixel error metric and consequently better captures the response of the human visual system.
Acknowledgements This paper was prepared while Xose R. Fdez-Vidal was on leave from The Department of Applied Physics at Santiago University sponsored by the Xunta of Galicia, visiting the Department of Computer Science and A.I. at Granada University. The authors would like to thank Javier Martinez-Baena, Peter Kovesi and John G. Daugman for their help and comments, and Lex Toet for providing the target images in Fig. 5 as well as the corresponding subjective ranking. The authors also thank all the interest of the referees in order to improve the quality of this work. This research was sponsored by the Spanish Board for Science and Technology ŽCICYT. under grant TIC97-1150.
References Budrikis, Z.L., 1972. Visual fidelity criterion and modelling. IEEE Proc. 60, 771–779. Carl, J.W., 1987. Quantitative fidelity criterion for image processing applications. SPIE Proc. 858, 2–8. Daugman, J.G., 1988. Pattern and motion vision without Laplacian zero crossings. J. Opt. Soc. Am. Ser. A 5 Ž7., 1142–1148. Faugeras, O.D., 1979. Digital color image processing within the frame work of a human visual model. IEEE Trans. Acoust. Speech Signal Process. 27, 380–393. Fdez-Valdivia, J., Garcia, J.A., Martinez-Baena, J., 1996. The selection of natural scales for images using adaptive Gabor filtering. ftp:rrdecsai.ugr.esrdiatartech_reprTR960329. ps.Z, electronic edition. DECSAI-960329, Computer Science Dept. Gabor, D., 1946. Theory of communication. J. Inst. Electr. Engrg. 93, 429–457. Karunasekera, S.A., Kingsbury, N.G., 1995. A distortion measure for blocking artifacts in images based on human visual sensitivity. IEEE Trans. on Image Process. 4, 713–724. Kovesi, P., 1995. Image features from phase congruency. Technical Report 95r4, June 1995, Dept. of Computer Science, The University of Western Australia. Marr, D., 1982. Vision. Freeman, San Francisco. Marr, D., Hildreth, E., 1980. Theory of edge detection. Proc. Roy. Soc. London Ser. B 207, 187–217. Morrone, M.C., Owens, R.A., 1987. Feature detection from local energy. Pattern Recognition Letters 6, 303–313. Morrone, M.C., Burr, D.C., 1988. Feature detection in human vision: A phase-dependent energy model. Proc. Roy. Soc. London Ser. B 235, 221–245. Owens, R.A., 1994. Feature-free images. Pattern Recognition Letters 15 Ž1., 35–44. Rock, I., 1995. Perception. Scientific American Books, Inc., New York. Uttal, W.R., 1988. On Seeing Forms. Lawrence Erlbaum, Hillsdale, NJ. Wandell, B.A., 1995. Foundations of Vision. Sinauer Associates Inc. Publishers, Sunderland, MA.