A modified metric to compute distance

A modified metric to compute distance

Pattern Recognition, Vol. 25, No. 7, pp. 667-677, 1992 Printed in Great Britain 0031-3203/92 $5.00 + .00 Pergamon Press Ltd (~ 1992 Pattern Recogniti...

660KB Sizes 16 Downloads 80 Views

Pattern Recognition, Vol. 25, No. 7, pp. 667-677, 1992 Printed in Great Britain

0031-3203/92 $5.00 + .00 Pergamon Press Ltd (~ 1992 Pattern Recognition Society

A MODIFIED METRIC TO COMPUTE DISTANCE D. CHAUDHURI, C. A. MURTHY and B. B. CHAUDHURIt Electronics and Communication Sciences Unit, Indian Statistical Institute, 203 Barrackpore Trunk Road, Calcutta 700035, India

(Received 31 August 1990; in revised form 18 June 1991; received for publication 13 November 1991) Abstract--Euclidean distance is used in many practical problems. This paper proposes a new metric which is close to the Euclidean distance and also computationally more efficient. This metric is helpful when the dimension of the data set is large. Bounds of a measure of merit of the new metric as well as of the City-block and Chessboard metrics with respect to the Euclidean metric are analyticallyestablished. The utility of this metric is shown on a randomly generated data set in the context of clustering. Euclidean distance City-block distance Chessboard distance Minimal spanning tree Clustering Image processing

1. I N T R O D U C T I O N

Distance is a very important concept used widely in applied science problems such as pattern recognition and image processing.0-5) It is desirable that the distance is a metric. Three special cases of the Lp (Minkowsky) metric, namely City-block distance dc, Euclidean distance dE and Chessboard distance dM, are popular. For two n-dimensional points, X = (x~, X2. . . . . Xn) and Y = (Yl, Y2 . . . . . y~), the LP metric is defined as n

where dc, dR and dM correspond to p = 1, 2, and ~, respectively. Since the conventional data space is Euclidean, it is natural to use Euclidean distance in such a space. But computation of Euclidean distance is expensive in a high dimensional space, especially when such computation is to be performed on a large amount of data. A n example is the processing of multispectral imagery which contains more than a million pixels per image frame and many such frames are to be processed. In statistical pattern recognition(13) methods also, the distance is to be computed iteratively on a large amount of data. In order to reduce the computation, City-block and Chessboard distances are often used in image processing(12) and related problems. While these distances are computationally more efficient, they deviate markedly from the Euclidean framework and cause the accuracy of the final result to suffer. It is, therefore, desirable to find a distance function that is as close to Euclidean as possible and yet requires much less computation effort. Also, it is

Pattern recognition

desirable that this new distance is a metric. The purpose of the present paper is to propose such a metric. While various other metrics are defined~6/or distance functions on a digital grid are widely investigated, 17-9) no work similar to the present study is known to the present authors. The advantage of the metric proposed here is that its applicability is not restricted to digital space. It is a simple combination of City-block and Chessboard distances requiring very little computation effort. The new distance dN is proposed and its metricity is established in Section 2. Various properties of dN including its upper bound with respect to Euclidean distance are established in Section 3. An application of dN to cluster analysis of artificial data is presented in Section 4. Finally, a generalized class of metrics is proposed on the basis of the results in this paper.

2. M A T H E M A T I C A L F O R M U L A T I O N

Definition 1. Consider a bounded set S C_ Rn so that Int (S) 4: O where "Int" denotes interior.(1°) A metric d in S is a mapping d : S × S ~ [0, oo) satisfying the following properties: d(X, Y ) >- O V X , Y@ S

(la)

d(X, Y ) = OC:>X = Y

(lb)

d(X, Y ) = d(Y, X)

(lc)

and

d(X, Y) <- d(X, Z) + d(Z, Y ) V X, Y, Z E S. (ld) The last relation is called a triangular inequality.

Definition 2. For two points, X = (Xl, x2 . . . . . xn) and Y = (Yl, Y2. . . . , yn), let IXi - - Yil be the maximum for

t Author to whom correspondence should be addressed. 667

D. CHAUDHURIet al.

668

i = ixa-. The proposed distance is defined as

ICi - hi[ ~ ~ Ici[ Jr"~ [bi[

dN(X, Y) = [ x i ~ - Yixel

i=1

i=l

i=1

i.e.

1

~

n -

Ixi-Yi[

(2)

i~ix.y i=l

where [a] means the integral part of "a", i.e. the largest integer -< a.

~ Ibi[.

(I3)

[ai[ <- ~ Icil + i=l i=1

Thus inequality (I1) holds with the help of relations (I2) and (I3). So (ld) is satisfied by dN. Hence it is a metric. •

Theorem 1. dN is a metric. Proof. Let X = (xl, x2, . . . , xn), Y = (Yl, Y2 . . . . .

Note that the inequalities are numbered with "I" to distinguish them from equations.

Yn) and Z = (zl, z2 . . . . . zn) be three n-dimensional points in S. Let [xi - Yil = )a,l, [Y, - zil -- Ib, t and

Ix,- z,l = Ic, I. It can be shown easily that dN satisfies (la)-(lc). Property (ld) is true for n = 1. Without loss of generality let n->2. Now

dN(X,Z) + dN(Z, Y) -> dN(X,Y) *~ Ico~l +

• n --

+ [bizy[

n-

Computational aspects are discussed at the end of this section. About the second property, it may be verified that dc over-estimates and dM under-estimates the Euclidean distance, i.e. dc -> dE > dM. To satisfy the desirable property (ii), the new distance also should satisfy dc-> dN-----dM so that it can be closer to dE. It is readily seen that dN given by equation (2) satisfies this inequality. Our next step is to find how close dN is to dE. To compare any metric d with respect to dE, let us define

i~ixz

~ Ib, l i=l

[-~]

i~iz Y

la,I

-> la,~l + n --

As stated in Section 1, dN should satisfy the following desirable properties: (i) dN is computationally more efficient than dE; (ii) dN should be close to dE than dc and dM.

lea

1

+

3. P R O P E R T I E S O F T H E P R O P O S E D M E T R I C

i-~ixy

An(d) =

+ i Ict + i I ,l i=l

dE(X,Y)

Y)I

.X, Y E S

}

(3)

and gn(d)= sup An(d). It is desirable that gn(d) should be as small as possible.

i=1

->(n--[n~22]--l)

{Jd(X, Y ) - d E ( X ,

laixvl+~lail.

(II)

The following interesting results can be proved.

i=l

Theorem 2. If dN> dE then

Now Ici - bi[ <-I%zl + Ib izr[, i.e. max i

Ic, -

b,I -

max i

F ]/

{ICixz[ dr"Ibi~,l}

4 ( n - 1)\

JV~1+~-+-~T)-I

i.e.

g'(dN)---- [ ~ ( 1 + ~4(nT~)

la,A-< Icixzl + Ib,z,I.

forn=2,4,6 ....

1) _ 1 forn=3,5,7 ....

Observe that

(4) Theorem 3. g , ( d c ) = X/n - 1 for all n.

(5)

SO

Theorem 4. gn(dM) = 1 - ~ n

n-2

Again Ici - b i I <- [ci I + [bi l, i.e.

(I2)

1

foralln.

(6)

Theorems 2--4 provide the exact upper bounds of deviation of the new metric under the condition dN> dE as well as the City-block and Chessboard

A modified metric to compute distance 4.00

669

m gnCdN )

0 Un(d c)

3.6e

~, Qn(dM ) 3.20 A

2.80

0

2.40

|

0

0

0

0

0

2.00

0 0

£.

0

1.60

v

0

0

0

0

0

"I[3

0

1.20

C

0 0.80

0



e

e

i

|

,~

0

0.40

e El

B.

[]

[]

r~

r~

r~

,

I

I

I

l

l g

2

4 6 n (Dimension)

B

ra

l

,

10

i

12

.

w

14

)

|

16

i

lg

i

20

.....

Fig. 1. g,(ds), g,(dc) and g,(dM).

metric from the Euclidean metric. They are compared in Fig. 1.

metric somewhat under-estimates the Euclidean metric.

Lemma 1. If n = 2 then

Theorem 6. For sufficiently large n, i.e. when n ~ oo, the limiting value of gn(ds) lies between 0 and 1, i.e.

dN(X, Y) >-dE(X, Y)

foreveryX, Y E S .

0 -< lira gn(dN) --< 1.

(I4) Since by lemma 1 dN ~z dE for n = 2, theorem 2 also provides the exact upper bound of deviation of the new metric from the Euclidean distance.

Theorem 5a. If dE > dN then

Theorem 7. For sufficiently large n, dN(X, Y) is closer to dE(X, Y) than both dc(X, Y) and riM(X, Y) for every X, Y E S, i.e. IdN(X, Y) - dE(X, Y)I <--Idc(X,Y) - dE(X, Y)I

1 n--2

g.(dN) --< 1

(I8)

forn --~ 3.

(19)

(15) and

Ids(X, y) - dE(X, y)I <--IdE(X,Y) - dM(X, y)[. Theorem 5b. If dE > d N then g,(dN)>-- 1

I X/n

I X/n"

(I10) (n - 1)

(n

.

(16)

Theorem 5 states that for dE > tiN, the upper bound of deviation of the new metric from the Euclidean metric lies between the right-hand side of inequalities (I5) and (I6).

Lemma 2. For sufficiently large n dN(X, Y) > dE(X, Y)

(17)

With Lebesgue measure zero, i.e. we almost always have dN(X, Y) <- dE(X, Y). Lemma 2 states that in higher dimension, the new

All the theorems and lemmas are proved in the Appendix.

3.1. Computational aspect We observe that for each pair of points in R" the following computations are needed for Euclidean distance: (i) n multiplications, (ii) (n - 1) additions and (iii) one square root. On the other hand, computation of dN(X, Y) needs: (i) (n - 1) comparisons, (ii) (n + 2) additions and (iii) two divisions. Since multiplication needs more computation than comparison, dN(X, Y) is more efficient than

dE(X, r).

670

D. CHAUDHURI et al.

Table 1. The data set considered for clustering Data point

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

0.000 0.934 0.958 0.599 0.179 0.412 0.103 0.516 0.717 0.734 0.379 0.938 0.171 0.876 0.629 0.182 0.761 0.353 0.347 0.130 1.372 1.035 1.211 1.725 1.828 1.380 1.776 1.508 1.168 1.345 1.040 1.459 1.703 1.208 1.661 1.373 1.578 1.578 1.559 1.051 0.603 0.809 0.665 0.519 0.100 0.102 0.555 0.556 0.516 0.517 0.723 0.765 0.552 0.163 0.254 0.443 0.552 0.810 0.626 0.491

2

3

0.001 0.005 0.891 0.944 0.851 0.484 0.076 .0.069 0.088 0.915 0.656 0.224 0.999 0.068 0.713 0.629 0.768 0.152 0.123 0.129 0.931 0.177 0.659 0.511 0.461 0.223 0.470 0.940 0.928 0.905 0.117 0.068 0.576 0.608 0.496 0.801 0.216 0.175 0.986 0.746 1.523 1.786 1.058 1.038 1.586 1.615 1.342 1.529 1.517 1.563 1.784 1.280 1.448 1.702 1.061 1.791 1.432 1.077 1.788 1.622 1.547 1.923 1.327 1.833 1.947 1.349 1.161 1.088 1.607 1.696 1.783 1.337 1.246 1.272 1.973 1.637 1.141 1.813 1.075 1.989 0.511 0.637 0.543 0.974 0.689 0.148 0.533 0.532 0.458 0.846 0.719 0.397 0.728 0.374 0.747 0.480 0.178 0.424 0.933 0.949 0.493 0.452 0.775 0.762 0.522 0.165 0.764 0.115 0.987 0.637 0.754 0.538 0.274 0.682 0.767 0.315 0.397 0.743 0.642 0.431

Dimension 8 9

4

5

6

7

0.020 0.642 0.240 0.726 0.699 0.445 0.420 0.362 0.998 0.668 0.687 0.134 0.190 0.407 0.077 0.350 0.465 0.341 0.108 0.606 1.010 1.702 1.414 1.096 1.262 1.625 1.179 1.195 1.576 1.645 1.609 1.057 1.576 1.082 1.712 1.975 1.418 1.067 1.615 1.263 0.226 0.961 0.685 0.392 0.959 0.909 0.693 0.157 0.936 0.293 0.275 0.599 0.297 0.817 0.942 0.444 0.622 0.987 0.887 0.810

0.074 0.356 0.090 0.736 0.957 0.652 0.902 0.507 0.622 0.848 0.525 0.200 0.136 0.980 0.313 0.494 0.320 0.840 0.071 0.921 1.988 1.871 1,953 1,815 1,696 1,229 1,756 1.052 1.766 1.269 1.352 1.844 1.315 1.699 1.013 1.821 1.058 1.666 1.372 1.670 0.625 0.998 0.778 0.560 0.136 0.886 0.794 0.626 0.803 0.221 0.580 0.736 0.292 0.866 0.919 0.821 0.599 0.090 0.637 0.984

0.267 0.362 0.377 0.886 0.448 0.906 0.635 0.786 0.743 0.073 0.967 0.999 0.101 0.218 0.187 0.811 0.736 0.971 0.456 0.069 1.836 1.909 1.988 1.025 1.823 1.751 1.923 1.559 1.409 1.808 1.630 1.552 1.699 1.460 1.670 1.150 1.589 1.392 1.694 1.657 0.713 0.344 0.502 0.837 0.186 0.133 0.529 0.338 0.396 0.686 0.007 0.023 0.085 0.839 0.035 0.926 0.991 0.654 0.840 0.612

0.934 0.961 0.449 0.689 0.081 0.566 0.694 0.151 0.865 0.805 0.083 0.192 0.386 0.493 0.306 0.419 0.534 0.270 0.098 0.127 1.122 1.619 1.350 1.818 1.670 1.444 1.736 1.883 1.562 1.430 1.606 1.718 1.363 1.469 1.900 1.504 1.011 1.360 1.816 1.912 0.657 1.079 1.006 1.978 1.899 1.824 1.028 1.400 1.144 1.126 0.820 0.511 0.877 0.244 0.938 0.169 0.557 0.114 0.305 0.816

0.204 0.513 0.306 0.164 0.450 0.243 0.451 0.833 0.501 0.171 0.791 0.158 0.404 0.996 0.151 0.215 0.576 0.882 0.484 0.139 1.212 1.532 1.211 1.681 1.615 1.904 1.105 1.267 1.694 1.310 1.970 1.336 1.888 1.669 1.370 1.680 1.764 1.634 1.653 1.561 1.527 1.375 1.522 1.337 1.715 1.744 1.403 1.355 1.306 1.581 1.860 1.862 1.500 1.914 1.310 1.678 1.425 1.799 1.275 1.387

0.812 0.429 0.794 0.784 0.975 0.363 0.456 0.635 0.222 0.782 0.002 0.224 0.947 0.533 0.153 0.522 0.651 0.857 0.020 0.689 1.171 1.617 1.116 1.722 1.662 1.430 1.009 1.656 1.102 1.989 1.363 1.553 1.062 1.793 1.119 1.541 1.486 1.558 1.570 1.156 1.246 1.543 1.076 1.221 1.202 1.050 1.165 1.528 1.538 1.353 1.781 1.567 1.106 1.286 1.419 1.549 1.538 1.763 1.901 1.978

10

11

12

13

14

15

0.041 0.952 0.008 0.222 0.798 0.990 0.677 0.318 0.824 0.156 0.889 0.922 0.047 0.237 0.556 0.199 0.725 0.211 0.760 0.879 1.115 1.911 1.793 1.207 1.435 1.445 1.109 1.534 1.369 1.139 1.447 1.295 1.376 1.741 1.381 1.128 1.043 1.647 1.543 1.888 1.731 1.882 1.759 1.293 1.774 1.610 1.365 1.973 1.473 1.894 1.943 1.646 1.133 1.491 1.728 1.195 1.401 1.390 1.934 1.382

0.935 0.855 0.906 0.282 0.011 0.675 0.958 0.188 0.843 0.895 0.316 0.513 0.761 0.624 0.957 0.495 0.491 0.550 0.383 0.076 1.155 1.919 1.719 1.743 1.654 1.805 1.573 1.297 1.293 1.936 1.415 1.794 1.698 1.308 1.221 1.899 1.879 1.856 1.130 1.922 0.170 0.410 0.872 0.774 0,825 0.203 0.707 0,090 0,993 0.186 0,629 0.775 0.840 0.368 0.593 0.231 0.559 0.476 0.495 0.490

0.240 0.560 0.361 0.693 0.887 0.138 0.662 0.270 0.241 0.963 0.895 0.777 0.141 0.612 0.742 0.177 0.421 0.401 0.461 0.545 1.894 1.312 1.175 1.594 1.011 1.822 1.458 1.979 1.436 1.364 1.471 1.110 1.804 1.179 1.891 1.237 1.888 1.316 1.887 1.543 0.444 0.515 0.398 1.007 1.982 1.734 1.952 1.782 1.700 1.065 0.286 0.840 0.845 0.794 0.007 0.630 0.749 0.345 0.559 0.503

0.031 0.666 0.012 0.616 0.222 0.756 0.346 0.930 0.956 0.725 0.522 0.045 0.001 0.059 0.837 0.610 0.108 0.451 0.316 0.591 1.966 1.602 1.576 1.881 1.175 1.691 1.592 1.194 1.985 1.763 1.093 1.514 1.543 1.298 1.360 1.336 1.420 1.189 1.153 1.959 0.137 0.406 0.539 0.072 0.470 0.574 0.35l 0.885 0.264 0.718 0.048 0.062 0.514 0.445 0.704 0.701 0.462 0.782 0.901 0.603

0.023 0.960 0.824 0.461 0.348 0.294 0.119 0.145 0.569 0.687 0.079 0.282 0.739 0.845 0.348 0.063 0.859 0.100 0.748 0.641 1.748 1.805 1.883 1.941 1.958 1.744 1.424 1.357 1.983 1.306 1.314 1.090 1.021 1.175 1.141 1.878 1.528 1.287 1.937 1.867 0.822 0.800 0.652 0.370 0.981 0.835 0.537 0.270 0.288 0.723 0.716 0.812 0.479 0.530 0.166 0.534 0.028 0.587 0.375 0.092

0.857 0.766 0.835 0.221 0.091 0.958 0.598 0.504 0.809 0.593 0.774 0.285 0.420 0.539 0.553 0.888 0.180 0.540 0.644 0.523 1.794 1.409 1.112 1.717 1.166 1.246 1.221 1.397 1.031 1.966 1.047 1.919 1.232 1.373 1.606 1.247 1.388 1.024 1.248 1.567 0.701 0.145 0.064 0.571 0.655 0.845 0.065 0.658 0.352 0.872 0.869 0.310 0.246 0.171 0.656 0.893 0.010 0.485 0.145 0.124

A modified metric to compute distance 4. APPLICATION

671 31

In this section an example is provided to show the applicability of the new distance measure in the context of clustering. This example also shows that the City-block and Chessboard distances are not better approximations to Euclidean distances than the new distance. Data have been artificially generated in a 15dimensional space. Three sets A = [0, 1] 15, B = [1, 2] 15 and C = [0, 1] 7 × [1, 2] 3 × [0, 1] 5 are chosen so that C overlaps with A and B partially. Twenty points are randomly selected from each set. The 60 points are numbered 1, 2 , . . . , 60 where the points numbered 1-20 are in A, 21-40 are in B and 41-60 are in C. The points are shown in Table 1. The procedure to classify the above 60 points is based on the minimal spanning tree (MST). (4) There are various ways in which clusters are detected using MST. Here, one of the methods is stated below. (1) Draw the MST of 60 points where the edge weight is considered to be the distance between the corresponding points. (2) Remove those two edges from MST whose edge weights are maximum. The points in the resulting three trees give the three clusters.

37

~0 / ~ ' ]

9

24

3~3335

30

3 ~ 5 B

~.0

48

"~,,,s'..... 4s 57

60 59

50

52 51

/*1 /,4 12

17

15'I"

10 ,..... 7 4

14

9

19

II

20

I 36

Fig. 3. The MST using the Euclidean distance.

29

26

30

37 35

28

22

40

23

Scale ~--~ : 1

39 27

32

25

31 38

24

3/,

/,2 57 59 (-5

/,9 ~6

60

47

5~. 55

¢3

53

56

5

58

'

2

10 t,

51

~,l

9

14

12

~

52

6 18

Here, we have considered four distances dc, dE, dM and dN and a MST is drawn for each of them (Figs 2-5, respectively). Note that the MST using the Euclidean distance is closer to the MST using the new distance than the MST using the Chessboard distance, since the MST using the new distance differs from the MST using the Euclidean distance at 14 places whereas the MST using the Chessboard distance differs from the MST using the Euclidean distance at 29 places. Also Table 2 gives the classification results under the four distances. Observe that dE, dM and d N classify all points correctly. But City-block distance does not provide the required classification. Thus the new distance is indeed close to the Euclidean distance.

15

I

Fig. 2. The MST using the City-block distance.

5. DISCUSSION

In this paper, an approximation to Euclidean distance is suggested for higher dimensions and it is compared with the City-block and Chessboard distances. But the existence of an even better approximation to Euclidean distance is to be studied. A class of metrics can be defined on the basis of the results of this paper as ,=2

p(X, Y) = al + -

L

(7)

672

D. CHAUDHURI et al.

20 Stole I

I-- ]

3

SCOlel

!:1

36---'~,

/

29~\

21,

~7

~

" 1

30

59

"N~""~ 22

~~ss

t8

34

5~,

5

42 "58

~$3 ~56

5~ 04~ I,¢

~4.9

56

20 It,

15

12~--.--~I1

191 , ~ 1 7

7

~

~

2

5

I) ./

I\ 0

1

Fig. 5. The MST using the new distance. 1

Fig. 4. The MST using the Chessboard distance.

where X = (xl, x2 . . . . . xn) and Y = (Yl, Y2. . . . . Yn) are two n-dimensional points in S, [xi - Yil = ai for all i, a l > ai for i = 2, 3 . . . . . n, f , > 0 for all n and {f.} is a sequence in n. It can be easily shown that for every n and for every positive f., the distance measure in equation (7) is indeed a metric. The new metric stated in this paper is a particular case of equation (7). Thus equation (7) gives a generalized system of metrics. The utility of the above relation for various values

of {f,} is to be studied in practical problems. Currently we are investigating this topic in problems related to clustering, distance transforms, etc. 6. SUMMARY

This paper deals with the problem of finding a new distance measure for large dimensional data so that the proposed distance is computationally more efficient than the Euclidean distances and it is closer to the same than the City-block and Chessboard distances which are computationally more efficient than the Euclidean distance.

Table 2. Three dusters formed by using various distances Distance measure

The clusters

1. New distance

A = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20} B = {21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40} C = {41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60}

2. Euclidean distance

A = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20} B = {21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40} C = {41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60}

3. Chessboard distance A = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20} B = {21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40} C = {41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60} 4. City-block distance

A = {1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60} B = {5,

20}

C = {21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40}

A modified metric to compute distance Let the City-block, Chessboard, Euclidean and the new distances between the points X = (x~, x2, . . . . x,) and Y = (Yl, Y2. . . . . y~) in a b o u n d e d set S C_ R n be denoted by dc(X, Y), dM(X, Y), dE(X, Y) and dN(X, Y), respectively. T h e n

673

A n example is provided to show the utility of the new distance in the context of clustering on a 15dimensional data set. Finally a generalized system of metrics is defined as

i=2

d c ( S , Y) = ~

IXi

p(X, Y) = a 1 -t- - -

- Yil

f.

i=l

dM(X, Y) = Max i

dE(X, Y)

Ix,

where [xi - yil = ai for all i, al >- ai for all i, f, > 0 for all n and {f,} is a measure in n. It can be easily shown that for any positive sequence {f,} the generalized distance measure is indeed a metric. The new metric stated in this paper is a particular case of the generalized system of metrics.

- yi[

(X i -- yi) 2

--

i=1

dN(X, Y ) = l x i ~ - y i ~ l

Acknowledgement--The authors wish to thank Prof. D. +

• n -

Ix~ -- Yil i~ixy

where Ixi - Yil is m a x i m u m for i = ixy and [a] indicates the integral part of "a", i.e. the largest integer -< a. It is very easy to see that dN is computationally more efficient than dE. It is shown that (i) dN(X, Y) is a metric. (ii) For sufficiently large n IdN(X, Y) - dE(X,

r)l

-< IdE(X, Y ) - d u ( X , Y)I

and tdN( X, Y) - dE(X,

Y)l -< Idc(X, Y) -

dE(X, Y)t.

(iii)

rld~(X, r ) - dE(X,Y)I

g"(dN) = Sup {"

d--~, "~

-

-

_/

:X, Y e S }

will be I~/(

g'(dN)= (~(

4(n- 1)~_ 1+(n+2)2 ] 1

forn=2,4,6 ....

4(n--1)~_ 1+(n+3)2 ] 1

forn=3,5,7 ....

if d N > d E. Otherwise

1

1

1

~/n

~/n

( n - 1)

Dutta Majumder and Mr N. Chatterjee for their interest in this work. The authors also acknowledge Mr J. Gupta for typing the manuscript and Mr S. Chakraborty for his drawings. REFERENCES

1. G. Borgefors, Distance transformations in arbitrary dimensions, Comput. Vision Graphics Image Process. 27, 321-345 (1984). 2. B. Kleiner and J. A. Hartigan, Representing points in many dimensions by trees and castles, J. Am. Statist. Ass. 76, 260-276 (1981). 3. M. Yamarhita and T. Ibaraki, Distances defined by neighbourhood sequences, Pattern Recognition 19, 337-346 (1989). 4. C. T. Zahn, Graph theoretic methods for detecting and describing gestalt clusters, 1EEE Trans. Comput. No. 20, 68--86 (1971). 5. M. R. Anderbug, Clustering Analysis for Application. Academic Press, New York (1971). 6. R. M. Cormack, A review of classification, R. Statist. Soc. Series A 134(3), 321-367 (1971). 7. A. Rosenfeld and J. L. Pfaltz, Distance functions on digital pictures, Pattern Recognition 1, 33--61 (1968). 8. R. A. Melter and I. Tomescu, Path generated digital metrics, Pattern Recognition Lett. 1,151-154 (1983). 9. P. E. Danielsson, Euclidean distance mapping, Comput. Graphics Image Process. 14, 227-248 (1980). 10. T. M. Apostol, Mathematical Analysis, pp. 151-152. Addison-Wesley, Reading, Massachusetts (1957). 11. P. Billingsley, Probability and Measure. Wiley, New York (1979). 12. J. A. Richards, Remote Sensing Digital Image Analysis--an Introduction. Springer, Berlin (1986). 13. P. A. Devijver and J. Kittler, Pattern Recognition--a Statistical Approach. Prentice-Hall, London (1982).

{n_[~_.2]} APPENDIX

Proof of theorem 2. In order to find gn(dN) we shall use the <--g.(dN) <-- 1 and

0 - lim g . ( d y ) <- 1. n~

(iv) Min {Sup ~"ldc - dE I'[, Sup ~ IdM-7-dz [ ~ [ dE J I. dE JJ

> sup{%: !}.

following theorem. (m)

Theorem 2.1. Let f have continuous second-order partial derivatives on an open set S in E,. Let X0 be a point of S for which Dlf(Xo) . . . . . D,f(Xo) = 0. Assume that the determinant A = det {Di.jf(Xo)} ~ O. Let A0 = 1 and A,-k be the determinant obtained from A by deleting the last k rows and columns. If the n + 1 numbers A0, A~. . . . . A, are all positive then f has a local minimum at X0. If these numbers are alternatively positive and negative, then f has a local maximum at X0. Now, let al -> ai, i = 2, 3, . . . , n.

D. CI4AUDHURI et

674

Case (i).

Let n be even, i.e. let n = 2k. Then

al.

It can be easily found that

g.(dN) = Sup { ~ }

Djj atZ, = - (k2 + 4k - 1)(k + 1) '

a 1 +.--.~a k+li=

= Sup

2k ~ ",i=1

(n 2 + 8n - 4)(n + 2)

i

2

and

Dij

[1 + .----v~ z, ~___k ,=___2_~ + 1

i

D2.2

D2.3

.....

D2.,

D3.2

D3.3

.....

D3 m

Dn, 2

Dn, 3

.....

D,,,

.J(l+

(A4)

(n 2 + 8n)3/2 1

a,9

=Sup

( k 2 + 4 k ) 3/2

k+ 1 4(n + 2) at 2 = (k 2 + 4k)3/2 = (n2+ 8n)3/2,

i@j.

(A5)

(A1)

~ , z~)

Now the determinant

A= (n-l)x(n-l)

n+2

-(n2+8n-4)

4

.....

4

4

- (n 2 + 8n - 4)

.....

4

4

4

.....

- ( n 2 + 8n - 4)

~,-t

= [(n 2 + 8n)3/2 J

It can be seen that where zi =

ai/al,

i = 2, 3 . . . . .

2k. Let A=( 1

2k

x {4(n - 2) - (n 2 + 8n - 4)}(n 2 + an) ~-2 #: 0

1 + .-""'-7. ~ , zi k+li= 2

f(Z) =

f n+2 I ~-1 1)"'t( n2..... + 8n)3/2 /

1

(A2) .

1+ ~/\

i=2

2f

/

x {4(n - 2) - (n 2 + 8n - 4)}(n 2 + 8n) "-2 < 0

where Z = ( z 2 , z 3 . . . . . Z2k). Now differentiating (A2) partially with respect to zj, j = 2, 3 . . . . . 2k

n_3 (

A'-2=(-1) 1

1

----Zj

-t---

Dj -- Ozj -

/

2k

~,-~

n + 2

~._, = (-I) - I(.2 + gn)~/~I

z

2k

2*

~

/=2

\ 3/2

n+2

"1n - I

I(n2+8n)3/2 I

X (n 2 + 8n)"-3{4(n - 3) - (n 2 - 8n - 4)} > 0

/

(A3)

(1 + X \

Equating given by

8f/Sz/to

i=2

/

zero we get the only real solution of z~

1 £~=k+1

n+2

~2

A2 = [ ( n 2 + 8n)3/2 J {(n 2 + 8 n - - 4 )

fori = 2,3 . . . . ,2k.

2 --

16}>0.

Let Thus

A n _ l , An_2, . . .

, A 2 are alternatively

positive

and

negative. Hence function (A2) has a local maximum at Z. The maximum value

Let

=~/(1

Di, j = OZiOZj.

2k-1\

4 ( n - 1)

A modified metric to compute distance

Case (ii).

Let n be odd, i.e. n = 2k + 1. This proof is similar to that of case (i) that the function 1

flZ)

If both inequalities (A-I1) and (A-I2) are true, then the only possibility is

2k÷l

'+

675

g , ( d c ) = V'n - 1.

SX z, i=2

=

(A7)

Proof of theorem 4. Let a~ >-aifor



i = 2, 3 . . . . .

n. There-

fore

l+Ez i=2

has a m a x i m u m at Z = (~2, 23 . . . . . 1

,?.~ = ~ - ~ V i

12k+i) where

= 2,3,...

T h e m a x i m u m value =

, 2 k + 1.

1+ ~

- 1 Now

(A8) =

1 + ~'-~"~'7)

- 1.

a ~\i=1

Proof of theorem 3.

>-aigi).

(sincea,

Therefore

g . ( d c ) = Sup { ~ }

1

=Sup

,/2 ~.\i= 1

1

1

~n'

Hence

(sincedc ~dE)

1

g.(du) <- 1 Y~a~ iffil

.

1

.

'iffi 1

X/n"

(A-I3)

Again let us a s s u m e that the points are b o u n d e d in ndimensional space. A s in the case of t h e o r e m 3, consider the points X and Y such that Ix~ -,Yil = ai = m V i = 1, 2 . . . . . n. Now dE = mVn and dM = m. Therefore

(A9)

,1(~ a}ln) ~

ai < 1

~

/

1

= Sup

<-a,X/n /

/

IdE Now

aMI

1 1

dE

~n'

-

n

/.._Z

So

. a, 2 X a~ >>-[i-:-~-) ( f r o m C a u c h y - S c h w e r t z i n e q u a l i t y ) .

g.(dM) =

i=l

Sup

>- 1 - ~ n '

(A-I4)

Hence, from inequalities (A-I3) and (A-I4) we conclude that

Therefore n

1 1

~=,

g.(dM) = 1 - -~n" g

n

n

Proofoflemma

1. Let al -> a2 >- 0. So dE(X, Y) = X/(a~ + Now (a~ +

dN(X, Y) = at + a2/2. a2/2) 2 = aT + a2/4 + ala2. Therefore a 2)

Therefore g . ( d c ) -< V'n - 1.

and

(A-I1)

Again let us a s s u m e that the points are b o u n d e d in ndimensional space. Consider two points X and Y such that [x~ - y~[ = ai = mVi=l,2 . . . . . n. Now dE = m%/n and dc = nm. Therefore

al+ •

-(a21+a2)=-~(4a,-3az).

Since al ~ a2, we have 4al

Proof of theorem 5a. Let

dc - dE = ~ / n -- 1. A,(dc) = dE

a, >- ai for i = 2, 3 . . . . .

a I -{"



n. There-

(A-I2). •

~

a~

r/-

g~(dN)=Su p 1 >- ~v/n - 1 .

3a 2 ~ 0. Therefore (al +

fore

g.(dc), which is the s u p r e m u m of A,(dc), cannot be smaller that ~/n - 1, i.e.

g.(dc) = Sup { ~ }

-

a2/2) >-X/(a] + a2). H e n c e d~(X, Y) >- dE(X, Y).

%/(~a21)

I

(All)

D. CHAUDHURI et al.

676

Now we know that ¢:} al

n

a~

1

.

~

ai >

n--

¢~(~ai)2+2(n_[_7])n-2 xal~a~(n-

a~+

X/(~aT)

-> :

~\i=l

:

2k

1

"

. >

1-

1

"

.

,

X/(~a2i)

~\i=l

~\i=l

/=1

n-[~-~]X

/

d N < dE ¢=>(4k + 1)(2k - 1)M 2 < (k + 2)2(2k - 1)m 2

/

Ms

"

~ai

:>

,/fi

a

n

1 n-2

i=1 ~(~

1 [~_2] '

-<1

(A-I7)

¼k+l 4 + 4k+'-'-'-i"

Proof of theorem 6. Let a 1 > a i for i = 2, 3 . . . . . n. Now by lemma 2, ds < dE for sufficiently large n. Therefore 1

From (A-I6) and (A-I7) we have

al + 1

g,(dN) ~ 1

k

4k+ 1

Observe that if n/8 > MS/m 2, m ~ O, then dN < dE. Observe that m #: 0 for uncountably many points in S x S. In fact, for a given point X E S, m will be zero with Lebesque measure zero since the n-dimensional Lebesgue measure of an (n - 1)-dimensional set in zero. m> Hence dN > dE with measure zero.

/

Therefore

1-

k2+2k+l

<:=~- T <

,.2.)

~\i=l

(A-I9)

i=2

i=2

(A-I6)

Again

i=1

2k

2 ~ a 2.

Let m < az for all i. (m > 0 is true for many values of X and Y.) So the right-hand side of (A-I9) will be greater than (k + 1)2(2k - 1)m 2. Now for sufficiently large n (i.e. for large k)

(f~a~)' \i=1

2k

+2(k+l)a1~ai<>(k+l)

(A-I5) n

1

2

(2k - 1)2M 2 + 2(k + 1)g(2k - 1)M = (4k + 1)(2k - 1)M s.

/

From ( A l l ) and (A-I5) we have

g,(ds) < 1

2 ,

Let a~ < M for all i. Therefore, the left-hand side of (A-I9) will be less than

.

,/(~a~i)

2

(~ail \i=2

~I-

n-2

In reality, values of ai are usually bounded. In that case, for sufficiently large n, d N < dE because inequality (A-I8) involves n. Putting n = 2k in (A-I8) we have

X/(~a2)

Therefore

a,+

a~

i= I n--

~Xi=l

~-

n-

_

al +

ff

.



lira

= 1 - lira ~xi=

Proof of theorem 5b. It has been assumed in Section 2 that the points are bounded in n-dimensional space and the interior of the set is non-empty. Hence, there exists two p o i n t s X a n d Ysuch that [x~ - Y~I = ai m V i = 1, 2 . . . . . n. Therefore

^ ~ ai

_-"

1

/

Put n = 2k. If n ~ oo then k---} oo. So 2k

(k + 1)al + ~ ai

=

laE - a s l

lira

,=s

= 1 - lim (k + 1)

a

('~'/oo form, by applying L'Hospital's rule) m +

(n - 1)m al n--

>

1-

= 1 -

l

i

m

-

m~/n ~l X i = l

1

1

( n - 1)

=i-:~n-:~nn'{n_[~_2]

}"



Proof of lemma. Let al >-a, for i = 2, 3 . . . . . dN X d E

n. Now

From the above limit we see that the limit exists, but it is not possible to find the exact value of this limit. We see that if a2 = a3, • •. = a2k = 0 then the limiting value will be 0. Otherwise if as, a3, . . . , az~ are non-zeros then the limiting value will be 1. Also if al = as . . . . . aEk then the

A modified metric to compute distance limiting value will be 1. Now

677

(by putting n = 2k)

'uPf l=Su f!im )

2k

<:>4(k + 1) 2 ~ a~ X 4(k + 1)2a { i-I

Therefore

2k

0-< lim gn(dN)--< 1.

2k

2

+ 4(k + l)(k + 2)al ~ ai + (k + 2) 2 (~,, ai)



i~2 n

Proof of theorem 7. Let al -> ai for i = 2, 3 . . . . .

n. Now

by iemma 2

/

¢~4(n + 2) 2 ~ a ] @ 4(n + 2)(n + 4)al ~ ai i=2

dM <--dN --
\1=2 n

+ (n + 4) 2

i=2

a,

.

(A-I10)

For dN <- dE --
IdN - dEI X Idc - dE[ ~:~dE -- dN X dc - dE ¢~ 2dE X d c + dN ¢:~2(k+1)

fdN - d E l < l d c a~ X 2 ( k + l ) a l + ( k + 2 ) ~ a ,

"~/ \ i = l

/

dEI.

It is easy to see that i=2

IdN - - d E l < l d E - - dMI.

About the Author--D. CHAUDHURIwas born in Bolpur (Santiniketan), India. He received a Bachelor

of Mathematics (Hons) from Visva-Bharati University, Santiniketan, in 1984 and an M.Sc. (applied mathematics) from Jadavpur University, Calcutta, in 1987. Currently he is a regular research worker in the Department of Physical and Earth Science Division, Indian Statistical Institute, Calcutta. His fields of interest are pattern recognition, image processing and computer graphics.

About the h~uthor--C. A. MURTHYwas born in Ongole, India, on 12 June 1958. He received Bachelor

of Statistics (Hons) and M.Stat. degrees from Indian Statistical Institute (ISI), Calcutta, in 1979 and 1980, respectively. He worked as a research fellow in ISI up to 1987. He joined as a programmer in a project in which analysis of satellite imagery was the focal point. He received his Ph.D. degree from ISI, Calcutta, in 1989. The field of his doctoral dissertation was pattern recognition. Currently he is a programmer in the project on knowledge-based computing systems, which is jointly sponsored by UNDP and Department of Electronics, India. His fields of interest are pattern recognition, image processing, computer vision and fuzzy sets. He is a member of Indian Unit for Pattern Recognition and Artificial Intelligence and Indian Society for Fuzzy Mathematics and Information Processing.

About the Author--B. B. CHAUDHURIreceived the B.Sc. (Hons), B.Tech. and M.Tech. degrees from

Calcutta University, India, in 1969, 1972 and 1974, respectively, and the Ph.D. degree from Indian Institute of Technology, Kampur, in 1980. He joined the Indian Statistical Institute, Calcutta, in 1978 where he is currently a professor and Professor-in-charge of Physical and Earth Science Division. His initial research work was on dielectric and optical wave guides. Later on, he became more interested in pattern recognition, image processing, computer graphics and natural language processing. He has published 80 research papers in the international journals and has written a book entitled Two Tone Image Processing and Recognition. He was awarded the Sir J. C. Bose Memorial Award for best engineering science oriented paper published in JIETE in 1986 and the M. N. Saha Memorial Award for best application oriented paper published in 1989. He acts as a referee to many international journals. He was the winner of the Leverhulme Overseas Visiting Fellowship in 1981-82 to work at Queens University. He worked as a visiting faculty member at GSF, Munich, and a Guest Professor at the University of Hannover during 1986--88. He again visited several German, Italian and Swiss institutions during 1990-91. In 1986 he started a successful on-going Indo-German scientific collaboration in biomedical image processing and related topics. He is a senior member of IEEE and fellow/member of many academic professional bodies.