Signal Processing 119 (2016) 43–55
Contents lists available at ScienceDirect
Signal Processing journal homepage: www.elsevier.com/locate/sigpro
A weighted-ROC graph based metric for image segmentation evaluation Yuncong Feng, Xuanjing Shen, Haipeng Chen n, Xiaoli Zhang Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
a r t i c l e i n f o
abstract
Article history: Received 26 January 2015 Received in revised form 10 July 2015 Accepted 13 July 2015 Available online 21 July 2015
Evaluation of image segmentation algorithms is a crucial task in the image processing field. Generally, traditional objective evaluation measures, such as ME and JS, always give the same treatment to the object pixels and the background pixels in images, which is not reasonable in practical applications. To overcome this problem, a new objective evaluation metric based on the weighted-ROC graph is proposed in this paper. Considering that pixels in different positions may gain different importance, each pixel is given a weight based on its spatial information. The ROC (receiver operating characteristic) graph with weighting strategy is constructed to evaluate the performance of segmentation algorithms quantitatively. The proposed metric focuses on the segmented objects, which is similar to human visual system. Meanwhile, it reserves the robustness of ROC against the region imbalance. The experimental results on various images show that the proposed metric gives more reasonable evaluation results than other metrics. & 2015 Elsevier B.V. All rights reserved.
Keywords: Image segmentation evaluation ROC graph Spatial information Distance transform
1. Introduction Image segmentation, a fundamental yet still a challenging task in image processing, is to partition an image into some non-overlapping regions with coherent features, and simultaneously label the objects of interest from the background regions [1,2]. Now, it has been an essential preprocessing step for many applications such as computer vision, object recognition and medical image processing [3]. In the past decades, image segmentation has been studied extensively [4–7]. However, how to evaluate the performance of segmentation algorithms is still an open problem. Since there are many possible interpretations of an image, multiple valid solutions may be found for segmenting the given image. Hence, image segmentation n
Corresponding author. E-mail addresses:
[email protected] (Y. Feng),
[email protected] (X. Shen),
[email protected] (H. Chen),
[email protected] (X. Zhang). http://dx.doi.org/10.1016/j.sigpro.2015.07.010 0165-1684/& 2015 Elsevier B.V. All rights reserved.
is an inherently ill-posed problem [8,9], which makes its evaluation much difficult. Generally speaking, the existing evaluation methods can be classified into two categories: subjective ones and objective ones. The former, leaving a technique designer to judge the performance of image segmentation algorithms based only on intuition [8], is time consuming and prone to resulting in inconsistent evaluation results due to the vision discrepancy of humans. The latter can be further divided into the analytical methods, the empirical goodness and empirical discrepancy methods [10,11]. Analytical methods need not to implement the segmentation algorithms, but directly focus on analyzing segmentation algorithms from various aspects such as the algorithm principle, complexity and efficiency. However, these properties are usually independent of segmentation results, so it is difficult to effectively distinguish the differences of various algorithms. Both the empirical goodness and discrepancy methods can be defined as experimentation for they use segmentation results to indirectly evaluate the performance of segmentation algorithms. The
44
Y. Feng et al. / Signal Processing 119 (2016) 43–55
empirical goodness methods, also known as the standalone evaluation methods [12], evaluate the segmentation results based on some goodness parameters which are relevant to the visual properties extracted from the original image and the segmented image [13] without any prior knowledge. But this type of strategy may be unfair in some cases, especially, when the properties are used to design as well as evaluate segmentation algorithms. The empirical discrepancy methods evaluate segmentation algorithms based on the discrepancy between the segmented image and its reference image [14,15]. Now, they are believed to be superior to the goodness methods and commonly used for objective evaluation. A variety of empirical discrepancy metrics have been put forward. Generally, these metrics are based on the difference on the number or the position of misclassified pixels, the number of misclassified segmented objects, and the like. ME (misclassification error) is an early pixel-based evaluation metric which denotes the percentage of incorrectly classified pixels [16–19]. However, it fails to evaluate segmentations in the case of region imbalance. The JS (Jaccard similarity) is another widely used measure [20– 22] which is defined as a ratio of the intersection to the union of object regions in reference image and segmentation result. Nevertheless, the importance of spatial information for image pixels is neglected in the evaluating process. To overcome the shortcomings mentioned above, this paper provides a novel objective evaluation metric based on the weighted-ROC graph. In the metric, each pixel is weighted based on its spatial information to represent its importance. The contributions of the work can be summarized as follows.
(1) The importance of each pixel is taken into account in the evaluating process. Different pixels gain different importance based on their spatial information. (2) Weighted strategy is applied to the ROC graph and each segmented image can obtain a quantitatively evaluating score within the interval of [0, 1]. (3) It preserves the robustness of ROC against the region imbalance, which makes the assessment results are more in line with the subjective evaluation results by the human visual system. (4) The proposed method holds for both uniform illumination and non-uniform illumination images. The rest of the paper is organized as follows: motivation is given in Section 2; Section 3 details the implementation of the proposed weighted-ROC graph based metric; comparison experiments and discussions are given in Section 4; finally, conclusions are conducted in Section 5.
2. Motivation Nearly all the existing metrics treat each pixel equally in the process of evaluating segmentations. However, pixels may gain different importance. In detail, for some pixels, whether they are correctly segmented, the segmentation results may not badly affect human judgment or further automatic processing. However, if some pixels are incorrectly segmented, further processing may be heavily influenced. Generally, the discrepancy is caused by the spatial information of pixels.
Object Oversegmentation Undersegmentation
Fig. 1. Example of two different segmentation results:(a) segmentation result with misclassification pixels near the object edge; (b) segmentation result with misclassification pixels off the object edge; (c) contour of (a); (d) contour of (b); (e) ground truth.
Y. Feng et al. / Signal Processing 119 (2016) 43–55
FP P
N
TP
TN
FN
Fig. 2. Pixel classification: (a) ground truth, white part is the object region denoted as P (Positive), black part is the background region represented by N (Negative); (b) automatic segmentation result, TP (True Positive) and TN (True Negative) denote the correctly classified object and background pixels respectively, FP (False Positive) and FN (False Negative) denote the incorrectly classified background and object pixels respectively, which mean over-segmentation pixels and under-segmentation pixels.
Considering such an example shown in Fig. 1, two different segmentation results about the same image are shown in Fig. 1(a) and (b), respectively. The number of the misclassification pixels of Fig. 1(a), containing oversegmentation pixels and under-segmentation pixels, is identical with those of Fig. 1(b). The difference between Fig. 1(a) and (b) is that misclassification pixels in the former are much closer to the object edge. Traditional metrics, including ME, JS, and ROC [23], give the two segmentation results the same score. However, Fig. 1(a) should be given a severe punishment from the aspect of subjective judgments, because the discrepancy of segmentation edges between Fig. 1(c) and the ground truth Fig. 1(e) is very large which seriously affects the description of the object. While the segmentation contour in Fig. 1(d) is a little better comparing to Fig. 1(c), because it segments the edge of original object correctly which is of much importance to the segmentation result, although it increases the number of objects. So Fig. 1(b) should be given a higher score. Summing up, segmentation results are heavily affected by the pixels close to object edges. Therefore, for the segmentation image, each pixel should be given a weight which is related to the pixel's distance from the nearest edge in ground truth. The smaller the distance is, the more important the corresponding pixel is. 3. The proposed metric 3.1. The framework Let I denote an automatic segmented image, and R denotes the ground truth. Both I and R share the same size of MnN. Before the presentation of the framework, a definition is given first.
45
b) For a background pixel xij in image I, if xij' is a background pixel in image R, xij is defined as a True Negative (TN) pixel; otherwise, xij is defined as a False Negative (FN) pixel. Fig. 2 shows an example of the pixel classification. The steps of the proposed objective metric are as follows. Step.1 By comparing with the reference map R, classify pixels in image I into four classes: TP, FP, TN, FN. Step.2 Calculate the weighted TP rate (wTPR) and the weighted FP rate (wFPR) as: wTPR ¼ wTP=ðwTP þ wFNÞ
ð1Þ
wFPR ¼ wFP=ðwFP þwTNÞ
ð2Þ
where wTP, wFP, wTN, and wFN represent the weighted sum of TP pixels, FP pixels, TN pixels and FN pixels, respectively. The four variances will be discussed in Subsection 3.2. Step.3 Plot the point (wFPR, wTPR) on weighted-ROC space. Calculate the Euclidean distance between the plotted point a(x, y) and the perfect point p(0, 1) qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dða; pÞ ¼ ðx 0Þ2 þ ðy 1Þ2 ¼ x2 þ ðy 1Þ2 ð3Þ Step.4 Get the predicted score of the segmentation instance I with the assistance of d pffiffiffi pffiffiffi pffiffiffi SðaÞ ¼ ð 2 dða; pÞÞ= 2 ¼ 1 dða; pÞ= 2 ð4Þ
The final score S is within the interval of [0, 1]. The higher the S is, the better the segmentation result is. 3.2. The strategy of weighting Considering that the importance of pixels depends on their distances from boundaries, the key of weighting pixels is to design a proper algorithm to measure these distances. Here, distance transform is an alternative. In general, distance transform is an operator only applied to binary images. It can convert a binary image into a distance map in which each foreground pixel's minimum distance from the background is kept. First of all, the definition of the distance transform function DðdÞ is given as follows. Definition 2. For a binary image I, let I o ¼ fpjf ðpÞ ¼ 1; p A Ig be the object region, and I b ¼ fqjf ðqÞ ¼ 0; q A Igbe the background region. For each object (background) pixel p (q), its distance transform coefficient is defined asDo ðpÞ ¼ minfdðp; qÞg (Db ðqÞ ¼ minfdðq; pÞg). q A Ib
Definition 1:. Let xij ð1 r ir M; 1 r j rNÞ denote one pixel in the automatic segmented result I, and xij' ð1 ri rM; 1 rj r NÞ is the corresponding pixel in ground truth R. a) For an object pixel xij in image I, if xij' is an object pixel in image R, xij is defined as a True Positive(TP) pixel; otherwise, xij is defined as a False Positive(FP) pixel;
p A Io
In this definition, f ðdÞ denotes the gray value of the pixel “d”; d(x, y) denotes the city block distance. Do ðpÞ(Db ðqÞ) is the distance from the nearest background (object) pixel. An illustration of city block distance transform on the object pixels is shown in Fig. 3. Fig. 3(a) is an original binary image, in which “1” denotes a pixel in objective regions and “0” denotes a background pixel. Taking the pixel “1” (denoted as r) in the
46
Y. Feng et al. / Signal Processing 119 (2016) 43–55
Fig. 3. City block distance transform: (a) original binary image, (b) example of distance transform coefficient calculation, (c) final distance transform map. (For interpretation of the references to color in this, the reader is referred to the web version of this article.)
red box as an example in Fig. 3(b), it can be observed that its distances from the nearest “0” in the four directions (up, down, left and right) are 2, 3, 2 and 4, respectively. Therefore, the minimum distance is “2”, i.e. Do ðrÞ ¼ 2, then the corresponding value is 2 in the distance transform matrix. After processing other non-zero pixels in the same manner, the final distance transform map Do can be obtained for the object pixels, which is shown in Fig. 3 (c). If let I’1 I, and measure the distances from the nearest “0” of all the pixels labeled by “1”, the final distance transform map Db can be also obtained for the background pixels. In the distance transform coefficient matrix Do and Db, a maximum value can be obtained from each of them, which is denoted as Do max and Db max . The former one represents the maximum value of all the object pixels' minimum distances from the background, and it is defined as: Do
max
¼ max fDo ðpÞjp A I o g
ð5Þ
Similarly, the latter one represents the maximum value of all the background pixels' minimum distances from the objects, and it is defined as: Db
max
¼ max fDb ðqÞjq A I b g
ð6Þ
Intuitionally, for an object pixel p, if p¼arg(Do max ), p exerts the smallest influence on the segmentation performance among object pixels. Consequently, p should be given the smallest weight in the evaluation process comparing to other object pixels; Similarly, for a background pixel q, if q¼arg(Db max ), q should be given the smallest weight in the evaluation process among all the background pixels. For any pixelxij A I, its weight value is defined as: 8 wtpðxij Þ ; if xij A TP > > > > < wf pðxij Þ ; if xij A FP ð7Þ wðxij Þ ¼ wtnðxij Þ ; if xij A TN > > > > : wf nðxij Þ; if xij A FN where 1 ri rM; 1 r jr N, wtpðxij Þ ¼ Do max Do ðxij Þ, wf pðxij Þ ¼ Db max Db ðxij Þ, wtnðxij Þ ¼ Db max Db ðxij Þ and wf nðxij Þ ¼ Do max Do ðxij Þ. Then, the weighted sum of each pixel class (TP, FP, TN and FN) can be obtained by X wTP ¼ wtpðxij Þ ð8Þ 1 r i r M; 1 r j r N wFP ¼
X 1 r i r M; 1 r j r N
wf pðxij Þ
ð9Þ
wTN ¼
wFN ¼
X 1 r i r M; 1 r j r N
X 1 r i r M; 1 r j r N
wtnðxij Þ
ð10Þ
wf nðxij Þ
ð11Þ
Based on the four equations, the weighted TP rate (wTPR) and weighted FP rate (wFPR) can be obtained by Eqs. (1) and (2).
3.3. Rating in weighted-ROC graph In the weighted-ROC graph whose x-axis is labeled by wFPR and y-axis is labeled by wTPR, each point (wFPR, wTPR) corresponds to a segmentation result. A better segmentation should get a lower wFPR and a higher wTPR, so the best segmentation corresponds to the point (0, 1) in the weighted-ROC graph. By contrast, (1, 0) indicates the worst point which means that all the pixels are incorrectly segmented into their contrary region. One point in the weighted-ROC graph is better than another, if and only if it is closer to p(0, 1). Formally, a simple metric is designed to measure the segmentation quantitatively. Let a denote a point in weighted-ROC graph (see Fig. 4), its distance from the perfect point (0, 1) is dða; pÞ ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðx 0Þ2 þðy 1Þ2 ¼ x2 þ ðy 1Þ2
ð12Þ
Apparently, the performance of point a is inversely proportional to d (a, p). The score of point a is defined as: pffiffiffi scoreðaÞ ¼ 2 dða; pÞ ð13Þ pffiffiffi where 2 is the distance from the worst point to the optimal point p, which is illustrated by a red line in Fig. 4. To simplify calculation and make analysis easily, Eq. (13) is normalized by pffiffiffi pffiffiffi pffiffiffi pffiffiffi SðaÞ ¼ scoreðaÞ= 2 ¼ ð 2 dða; pÞÞ= 2 ¼ 1 dða; pÞ= 2 ð14Þ Following the rationale of the weighted-ROC graph, it is clear that S will be zero for the “worst segmentation” in which all the pixels are separated into wrong regions; conversely, S¼1 shall correspond to the “perfect segmentation”, namely the automatic segmentation is identical to the ground truth.
Y. Feng et al. / Signal Processing 119 (2016) 43–55
Fig. 4. Weighted-ROC graph.(For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)
4. Experimental results and analysis In this section, the effectiveness of the proposed metric is demonstrated on the object images and the text images by comparing with three other common metrics, namely traditional ROC graph, ME and JS. In order to make it convenient to compare, 1-ME is used to replace the ME metric in the comparison experiments. In addition, to verify the superiorities of our metric, small target images are also employed to test robustness against the region imbalance. To make the experiments more convincing, we run the proposed method on BSDS dataset. Four image segmentation algorithms are employed to obtain different segmentation results, including a global thresholding segmentation method, a direct blockbased segmentation method, Long's method [24] and Chou's method [25]. Three parameters (n, l and h) in Long's method are assigned with different values in different experiments, which have been listed along with its segmented images. 4.1. Evaluation on object images The proposed metric has been compared with others on several object image sets. For the sake of discussion, three of them are selected in this subsection to show the comparison results. Without loss of generality, the chosen image sets cover one uniform illumination image and two non-uniform illumination images, though there are not explicit reports on whether non-uniform illumination affects gauging segmentation. The two non-uniform illumination image sets are “polygon” and “headphone”, which are shown in Figs. 5 and 6, respectively. Along with them, their corresponding ground truth and automatic segmentation by different algorithms are also listed. Fig. 5(b) is the ground truth, in which the white region represents the object, and the black region represents the background. From the subjective aspect, Fig. 5(e) is very close to the ground truth, which makes it better than the other three segmented images (Fig. 5(c), (d) and (f))
47
overwhelmingly. Fig. 5(c) is a classical over-segmentation case, in which pixels in background region are identified as object pixels, and by contrast, Fig. 5(f) shows an under-segmentation case by classifying the object pixels into background. In Fig. 5 (d), over- and under- segmentations co-exist. The rows of Table 1 show the scores and ranks of four segmented images which are given by the four measures. From the table, it can be observed that all the measures deem that Fig. 5(e) shall rank first and Fig. 5(c) shall rank last; for Fig. 5(d) and (f), both the weighted-ROC and traditional ROC rank Fig. 5(d) in front of Fig. 5(f), while, 1-ME and JS give an opposite order. Now we return to Fig. 5 to explore the reason leading to such conflict: in Fig. 5(d), all the incorrectly segmented pixels are distributed in the interior of the object or the background, so they do not affect contour detection of the object; whereas, in Fig. 5(f), the contour of the object has been changed so much. Considering that image segmentation often acts as an pre-processing step of other image processing, the segmentation result in Fig. 5(f) is more unacceptable comparing to Fig. 5(d). Therefore, the proposed weighted-ROC is not only effective but more reasonable also. Let's see another image set in Fig. 6. In ground truth Fig. 6 (b), the black region indicates the object, and the white region does the background. It can be clearly noticed that Fig. 6(c) and (e) are very close to the ground truth, which makes that they, on one hand, look better than the other two segmented images (Fig. 6(d) and (f)) overwhelmingly, on the other hand, are difficult to be distinguished with each other. The evaluation comparison of the proposed measure with the other three ones is shown in Table 2. What is interesting is that all the measures give the same order on this image set. For Fig. 6(c) and (e), the former image ranks in front of the latter one with a weak edge by all the measures. Fig. 7 shows an uniform illumination image set. In order to compare the measures clearly, some automatic segmented images with similar quality are chosen here, such as Fig. 7(c), (e), and (f). From the evaluation comparison shown in Table 3, it can be seen that all the measures rank Fig. 7(e) in the first place, and rank Fig. 7(d) in the end of the order. For Fig. 7(c) and (f), the order given by the weighted-ROC is more reasonable. From Fig. 7(c), it can be observed that the character “A” on the pen does not be segmented in a satisfactory way, which may affect further image processing. Summing up, in two out of three images sets, the proposed measure gives a different rank to the segmentation models than do traditional ROC, 1-ME and JS; and naturally in one out of three images sets, all the measures give the same rank. That is, although the measures are all gauging segmentation performance, the different aspects are such that they will sometimes favor different segmentation models.
4.2. Evaluation on text images Now images become a useful tool for storing, conveying and representing information. In real world, the problem of identifying texts from images is often encountered. In general, image segmentation is often employed to abstract text pixels from the images before recognition algorithm. Hence, this subsection focuses on whether the weighted-
48
Y. Feng et al. / Signal Processing 119 (2016) 43–55
Fig. 5. The “polygon” image segmentation: (a) original image; (b) ground truth; (c) global segmentation; (d) direct block-based segmentation; (e) Long's method (n¼ 6, l ¼17, h ¼21); (f) Chou's method.
Fig. 6. The “headphone” image segmentation: (a) original image; (b) ground truth; (c) global segmentation; (d) direct block-based segmentation; (e) Long's method (n¼ 5, l ¼6, h ¼ 14); (f) Chou's method.
ROC works when gauging the segmentation performance in the case of text images. Figs. 8 and 9 show two test sets of text images. Ground truth image and four automatic segmented images are also provided here. The objective evaluation of the two image sets is shown in Table 4. For Fig. 8, the order ranked by our proposed metric is the same with the order given by the other three measures on this text image set. Fig. 8(e) is the
best segmentation result, followed by Fig. 8(d), (f) and (c). Obviously, this order is also consistent with subjective evaluation result by the human visual system. In Fig. 9, it can be apparently seen that Fig. 9(e) is very close to the ground truth and it shall overwhelmingly rank in front of other segmented images (Fig. 9(c), (d) and (f)), conversely, Fig. 9(c) is the worst segmented result for that over half of the texts in this image are failed to be segmented.
Y. Feng et al. / Signal Processing 119 (2016) 43–55
From Table 4, it can be noticed that all the measures rank Fig. 9(e) in the first place, and rank Fig. 9(c) in the end of the order. However, for Fig. 9(d) and (f), the weightedROC gives a different order comparing with the other three measures. In Fig. 9(f), the two over-segmentation blocks are very close to the boundary of the segmented texts, especially the nether one which covers the edge of the characters “o”, “n”, “p” and “r”, thus, the character recognition may be influenced to some extent. While Fig. 9(d) does not have such problem although there are three oversegmentation blocks in it. The weighted-ROC gives a more reasonable order: Fig. 9(d) is ranked in front of Fig. 9(f).
Table 1 Evaluation comparison on the “polygon” image set. Segmented images Scores(ranks) Weighted-ROC ROC Fig. Fig. Fig. Fig.
5(c) 5(d) 5(e) 5(f)
0.6938(4) 0.7959(2) 0.9987(1) 0.7731(3)
0.6770(4) 0.7438(2) 0.9991(1) 0.6877(3)
1-ME
JS
0.6683(4) 0.7265(3) 0.9992(1) 0.8634(2)
0.4664(4) 0.4825(3) 0.9974(1) 0.5576(2)
Table 2 Evaluation comparison on the “headphone” image set.
0.9804(2) 0.9383(4) 0.9827(1) 0.9607(3)
In some cases, small target images need to be segmented, for example, the infrared small target images in sky or sea-sky background, the tumor or lesion area in medical images. To verify the superiorities of our metric, small target image sets are employed to test robustness against the region imbalance. For the sake of discussion, two of them are provided in this subsection to depict the comparison results. Here, the segmentation results are obtained by utilizing Long's method with different parameter settings. The two small target image sets both are the “rice”, which are shown in Figs. 10 and 11, respectively. The objective evaluation results of them are shown in Table 5. For Fig. 10, there is no doubt that Fig. 10(e) shall rank in the first place, because it is very close to the ground truth. However, the other three test images (Fig. 10(c), (d) and (f)) are difficult to be distinguished with each other. In Table 6, the weighted-ROC gives Fig. 10(f) the lowest score, while the other three measures rank the Fig. 10(c) in the end of the order. Now let's return to Fig. 10 to search for the cause that leads to such a conflict: in Fig. 10(f), nearly half of the rice is segmented to the background incorrectly,
Segmented image Scores(ranks)
Weighted-ROC ROC 6(c) 6(d) 6(e) 6(f)
4.3. The case of region imbalance
Table 3 Evaluation comparison on the “pen&eraser” image set.
Segmented image Scores(ranks)
Fig. Fig. Fig. Fig.
49
0.9854(2) 0.9247(4) 0.9866(1) 0.9665(3)
1-ME
JS
0.9828(2) 0.9113(4) 0.9927(1) 0.9603(3)
0.9789(2) 0.8918(4) 0.9911(1) 0.9515(3)
Weighted-ROC ROC Fig.7(c) Fig.7(d) Fig.7(e) Fig.7(f)
0.9899(3) 0.8646(4) 0.9953(1) 0.9940(2)
0.9930(2) 0.8506(4) 0.9964(1) 0.8678(3)
1-ME
JS
0.9917(2) 0.8116(4) 0.9985(1) 0.9767(3)
0.9292(2) 0.3670(4) 0.9861(1) 0.8086(3)
Fig. 7. The “pen&eraser” image segmentation: (a) original image; (b) ground truth; (c) global segmentation; (d) direct block-based segmentation; (e) Long's method (n¼6, l¼ 20, h ¼ 25); (f) Chou's method.
50
Y. Feng et al. / Signal Processing 119 (2016) 43–55
Fig. 8. The first text image segmentation: (a) original image; (b) ground truth; (c) global segmentation; (d) direct block-based Segmentation; (e) Long's method (n¼ 6, l ¼3, h ¼ 12); (f) Chou's method.
Fig. 9. The second text image segmentation: (a) original image; (b) ground truth; (c) global segmentation; (d) direct block-based segmentation; (e) Long's method (n¼ 5, l ¼4, h ¼ 5); (f) Chou's method.
though they only account for a little percentage of the whole image. The shape of the “rice” has been changed so much that it affects the target recognition seriously; while, in Fig. 10(c), the “rice” is segmented successfully and the misclassification pixels are mainly distributed in the edge
of the background, so they do not affect contour detection of object. So the segmentation result in Fig. 10(c) is more acceptable than that in Fig. 10(f). From Fig. 11, it can be noted that Fig. 11(d) is very close to the ground truth. In Table 6, the four measures rank
Y. Feng et al. / Signal Processing 119 (2016) 43–55
51
Table 4 Evaluation comparison on two text image sets. Segmented image
Fig. 8
Fig. 9
(c) (d) (e) (f) (c) (d) (e) (f)
Scores(ranks) Weighted-ROC
ROC
1-ME
JS
0.7192(4) 0.9593(2) 0.9612(1) 0.8486(3) 0.6944(4) 0.9203(2) 0.9285(1) 0.9144(3)
0.7229(4) 0.9661(2) 0.9722(1) 0.8364(3) 0.7247(4) 0.9261(3) 0.9491(1) 0.9319(2)
0.6862(4) 0.9698(2) 0.9913(1) 0.8119(3) 0.6700(4) 0.9251(3) 0.9879(1) 0.9369(2)
0.6057(4) 0.9622(2) 0.9891(1) 0.7645(3) 0.6098(4) 0.9124(3) 0.9859(1) 0.9262(2)
Fig. 10. The first rice image segmentation: (a) original image; (b) ground truth; (c) n¼ 5, l¼21, h¼ 21; (d) n¼ 5, l¼23, h¼ 23; (e) n¼ 5, l¼ 23, h¼ 24; (f) n¼5, l¼ 25, h¼25.
Fig. 11(d) in the first place and Fig. 11(f) in the end of the order. For Fig. 11(c) and (e), the weighted-ROC ranks the former in front of the latter, while the other three measures give an opposite order. Similar to the first rice image set, the nether rice in Fig. 11(e) fails to be segmented, whereas all the segmented rice in Fig. 11(c) are very close to the ground truth although there are some over-segmentation pixels around the edge of the background. So it is reasonable to give Fig. 11(c) a higher score than Fig. 11(e). Summing up, the proposed measure has strong robustness against the region imbalance. Comparing to traditional ROC, 1-ME and JS, the weighted-ROC can give more reasonable evaluation results when evaluating the skewed data, therefore, it is superior to the traditional evaluation metrics. 4.4. The relationship of weighted-ROC with other measures From the comparison above, it can be observed that the ranks of the proposed weighted-ROC metric gives to each of the segmented images may be the same with the ones that other measures give in some cases; while in other cases, the
ranks they give may quite different. It would be interesting if the underlying relationship between the weighted-ROC with other measures can be found. As these measures have been employed to gauge several image sets, the evaluation scores can be used here to predict the measures relationship. This may seem artificial, but it does provide us with an insight into the problem in an easy way. Empirical relationships of weighted-ROC with other measures are shown in Fig. 12. Take Fig. 12(a) as an example, in the two dimensional graph, weighted-ROC is plotted on the x-axis and ROC is plotted on y-axis. Each mark on the graph corresponds to a segmented image. It can be noted that a fitting function of the two measures could be barely obtained if all the instances are taken into account. One alternative is to remove some noise. The experimental results above reminder us that when evaluating region imbalanced images, weighted-ROC and ROC tend to give different ranks. So the balanced images are marked with “n” and the imbalanced images are marked with “3”. The distribution of balanced images suggests that the weighted-ROC is linearly proportional to the ROC. Such following conclusions can be drawn:
52
Y. Feng et al. / Signal Processing 119 (2016) 43–55
Fig. 11. The second rice image segmentation: (a) original image; (b) ground truth; (c) n¼ 5, l¼ 21, h¼ 21; (d) n¼5, l¼21, h¼22; (e) n¼ 5, l¼ 23, h¼ 24; (f) n¼5, l¼ 24, h¼ 24. Table 5 Evaluation comparison on two small target image sets. Segmented image
Fig. 10
Fig. 11
Scores(ranks)
(c) (d) (e) (f) (c) (d) (e) (f)
Weighted-ROC
ROC
1-ME
JS
0.9558(3) 0.9771(2) 0.9892(1) 0.7448(4) 0.9719(2) 0.9785(1) 0.9304(3) 0.8894(4)
0.4202(4) 0.5450(3) 0.8554(1) 0.6805(2) 0.7502(3) 0.7884(1) 0.7608(2) 0.6524(4)
0.9172(4) 0.9690(3) 0.9974(1) 0.9945(2) 0.9599(3) 0.9880(1) 0.9865(2) 0.8730(4)
0.1073(4) 0.2429(3) 0.7950(1) 0.5478(2) 0.4122(3) 0.7007(1) 0.6617(2) 0.1740(4)
when evaluating region balanced images, weighted-ROC and ROC will give very similar results; while when evaluating region imbalanced images, there is no explicit relationship between their evaluation results. From Fig. 12(b), it is clear that such conclusion also holds for the relationship between weighted-ROC and 1-ME. In Fig. 12, the norm of residuals (NR), a measure that reflects the difference of all the instances and their fitting functions, is also listed. Obviously, a lower NR means a better function fitting. The NR of weighted-ROC vs. ROC, weighted-ROC vs. 1-ME, and weighted-ROC vs. JS are 0.1628, 0.15441, and 0.52325. Hence, the linearly proportional relationship of weightedROC and JS is not strong enough compared to that of weighted-ROC with the other two measures. Summing up, when evaluating region balanced images, the measures weighted-ROC, ROC, and 1-ME may give similar evaluation results. However, when evaluating region imbalanced images, weighted-ROC may give different results from other measures. Hence, in this case, it needs to be known that whether the contours of targets are the thing valued by humans, if so, weighted-ROC may be a better alterative because of its weight strategy. If not, other measures do not needlessly work worse than the weighted-ROC.
4.5. Evaluation on BSDS dataset Currently, the performance of state-of-the-art methods is usually verified on the BSDS dataset which includes test images and the corresponding ground truth. Therefore, we also run the proposed metric on this dataset to make the experimental results more convincing. Considering that the proposed metric, ME and JS are designed for binary classification, 28 test images with binary ground truth are chosen from the dataset in the experiment. Furthermore, to demonstrate that the proposed metric is more relevant to human visual system, three experts from the image segmentation field are organized to evaluate the segmentation results subjectively, and their evaluation results are taken as the benchmark. Here, the Spearman correlation coefficient is employed to estimate the correlation between the benchmark and the scores given by different metrics. A higher correlation indicates that the ranks given by the metric is more in line with evaluation results by human visual system. For two one-dimensional arrays X and Y with the same size N, the raw scores Xi and Yi are converted to ranks xi and yi, respectively. The Spearman correlation coefficient
Y. Feng et al. / Signal Processing 119 (2016) 43–55
53
Table 6 Evaluation comparison on the “plane” (ID ¼ 3063) image set. Segmented image
Fig. 13(c) Fig. 13(d) Fig. 13(e) Fig. 13(f) Spearman (ρ)
Scores(ranks)
Benchmark of ranks
Weighted-ROC
ROC
1-ME
JS
0.9446(1) 0.9412(2) 0.9375(3) 0.7169(4) 1
0.8558(1) 0.8518(3) 0.8535(2) 0.6282(4) 0.8
0.9651(1) 0.9636(3) 0.9642(2) 0.6049(4) 0.8
0.7947(1) 0.7868(3) 0.7891(2) 0.2230(4) 0.8
1 2 3 4 –
Fig. 12. Empirical relationship of weighted-ROC with other measures: (a) weighted-ROC vs. ROC; (b) weighted-ROC vs. 1-ME; (c) weighted-ROC vs. JS.
between them is defined as: ρ ¼ 1
6
P 2
2
di
NðN 1Þ
; ð1 r ir NÞ
ð15Þ
where di ¼ xi yi denotes the error between xi and yi. The value of jρj is varying from 0 to 1. The larger jρj is, the higher correlation between X and Y is.
Fig. 13 shows an image set named “plane” (ID ¼3063). In the figure, there is no doubt that Fig. 13(f) shall rank in the last place, because there are too many misclassification pixels. Comparing to Fig. 13(d) and (e), (c) is the most similar to ground truth Fig. 13(b). So the first place is given to Fig. 13(c). Fig. 13(d) and (e) are difficult to be distinguished from each other. However, if we look more closely, it can be seen that although there are two small over-
54
Y. Feng et al. / Signal Processing 119 (2016) 43–55
Fig. 13. The “plane” (ID ¼3063) image segmentation: (a) original image; (b) ground truth; (c) global segmentation; (d) direct block-based segmentation; (e) Long's method (n¼ 5, l ¼14, h ¼ 18); (f) Chou's method.
Therefore, the proposed metric is more in line with the human visual system.
5. Conclusion
Fig. 14. Average values of Spearman correlation coefficients between the evaluation measures and the benchmark.
segmentation regions in the upper left corner of Fig. 13(d), the pixels of propeller region in Fig. 13(d) is obviously better segmented than in Fig. 13(e). The objective evaluation results and the benchmark of image set Fig. 13 are shown in Table 6. From the table, it can be noted that the order ranked by the weightedROC based metric is in line with that given by the experts. The Spearman correlation coefficient ρ between them is “1”. Fig. 14 shows the average value of Spearman correlation coefficients between each evaluation measure and the benchmark for all the 28 test image sets. From the figure, it can be noted that the average value of Spearman for the weighted-ROC based measure is the highest. It means that the ranks given by the proposed evaluation method are closer to the benchmark than other evaluation methods.
In this paper, the problem of automatically gauging the performance of image segmentation algorithms is investigated. The proposed objective measure in this paper is based on the weighted-ROC graph. One of attractive features is that the importance of each pixel is taken into account in the evaluating process. Different pixels gain different importance based on their spatial information. In addition, the proposed measure is robust to region imbalance. To demonstrate the utility of the weighted-ROC graph based metric, a detailed comparison is performed with other three measures, including ROC, 1-ME, and JS. The test image sets cover object and text images. To verify the superiorities of our metric, small target images are also employed to test the robustness against region imbalance. The experimental results show that assessment results given by weighted-ROC are closer to the subjective evaluation results by the human visual system. In addition, the measure favors segmentation in which the contours of targets are perfectly detected. The relationship of weighted-ROC with other measures is also explored in an empirical way. The results suggest that weighted-ROC, ROC, and 1-ME correlate strongly for the case of region balanced images; for other cases, their correlations are weak. There is no explicit relationship between weighted-ROC and JS. To make the experiments more convincing, we run the proposed method on BSDS dataset. The experimental results indicate that the proposed metric is more in line with the human visual system. Further work needs to be done to refine the weighting scheme to promote the performance of weighted ROC. How to extend the measure to the case of two or more region segmentation is also one of future research branches.
Y. Feng et al. / Signal Processing 119 (2016) 43–55
Acknowledgments The authors would like to express our gratitude to editors and anonymous reviewers for their comments and constructive suggestions. This work is jointly supported by the National Natural Science Foundation of China, China (61305046), and the Natural Science Foundation of Jilin Province, China (20130522117JH, 20140101193JC). References [1] D.A. Forsyth, J. Ponce, Computer Vision: a Modern Approach, Pearson Education Limited , 2002. [2] M.E. Farmer, A.K. Jain, A wrapper-based approach to image segmentation and classification, IEEE Trans. Image Process. 14 (12) (2005) 2060–2072. [3] B. Peng, D. Zhang, Automatic image segmentation by dynamic region merging, IEEE Trans. Image Process. 20 (12) (2011) 3592–3605. [4] X. Liao, H. Xu, Y. Zhou, et al., Automatic image segmentation using salient key point extraction and star shape prior, Signal Process. 105 (2014) 122–136. [5] X. Yang, X. Gao, J. Li, B. Han, A shape-initialized and intensityadaptive level set method for auroral oval segmentation, Inf. Sci. 277 (2) (2014) 794–807. [6] X. Yang, X. Gao, D. Tao, X. Li, Improving level set method for fast auroral oval segmentation, IEEE Trans. Image Process. 23 (7) (2014) 2854–2865. [7] X. Yang, X. Gao, D. Tao, X. Li, J. Li, An efficient MRF embedded level set method for image segmentation, IEEE Trans. Image Process. 24 (1) (2015) 9–21. [8] R. Unnikrishnan, C Pantofaru, M. Hebert, Toward objective evaluation of image segmentation algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 29 (6) (2007) 929–944. [9] B. Peng, L. Zhang, D. Zhang., A survey of graph theoretical approaches to image segmentation, Pattern Recognit. 46 (3) (2013) 1020–1038. [10] Y.J. Zhang., A survey on evaluation methods for image segmentation, Pattern Recognit. 29 (8) (1996) 1335–1346.
55
[11] M. Polak, H. Zhang, M. Pi, An evaluation metric for image segmentation of multiple objects, Image Vis. Comput. 27 (8) (2009) 1223–1227. [12] P.L. Correia, F. Pereira, Stand-alone objective segmentation quality evaluation, EURASIP J. Appl. Signal Process. 1 (2002) 389–400. [13] H. Zhang, J.E. Fritts, S.A. Goldman, Image segmentation evaluation: a survey of unsupervised methods, Comput. Vis. Image Underst. 110 (2) (2008) 260–280. [14] P. Correia, F. Pereira, Objective evaluation of relative segmentation quality, in: 2000 International Conference on Image Processing (ICIP), vol. 1, 2000, pp. 308–311. [15] C. Graaf, A. Koster, K. Vincken, M. Viergever, Validation of the interleaved pyramid for the segmentation of 3d vector images, Pattern Recognit. Lett 15 (5) (1994) 467–475. [16] W.A. Yasnoff, J.K. Mui, J.W. Bacus, Error measures for scene segmentation, Pattern Recognit. 9 (4) (1977) 217–231. [17] M. Sezgin, B. Sankur, Survey over image thresholding techniques and quantitative performance evaluation, J. Electron. Imaging 13 (1) (2004) 146–168. [18] H.F. Ng, Automatic thresholding for defect detection, Pattern Recognit. Lett. 27 (14) (2006) 1644–1649. [19] J. Long, X. Shen, H. Zang, H. Chen, An adaptive thresholding algorithm by background estimation in Gaussian scale space, Acta Autom. Sin. 40 (8) (2014) 1773–1782. [20] D.W. Shattuck, S.R. Sandor-Leahy, K.A. Schaper, D.A. Rottenberg, R. M. Leahy, Magnetic resonance image tissue classification using a partial volume model, Neuroimage 13 (5) (2001) 856–876. [21] C. Li, C. Gatenby, L. Wang, J. C. Gore, A robust parametric method for bias field estimation and segmentation of MR images, in: Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 218–223. [22] C. Li, C. Kao, J.C. Gore, Z. Ding, Minimization of region-scalable fitting energy for image segmentation, IEEE Trans. Image Process. 17 (10) (2008) 1940–1949. [23] T. Fawcett., An introduction to ROC analysis, Pattern Recognit. Lett. 27 (8) (2006) 861–874. [24] J. Long, X. Shen, H. Chen, Interactive document images thresholding segmentation algorithm based on image regions, Comput. Res. Dev. 49 (7) (2012) 1420–1431. [25] C.H. Chou, W.H. Lin, F. Chang, A binarization method with learningbuilt rules for document images produced by cameras, Pattern Recognit. 43 (4) (2010) 1518–1530.