Accepted Manuscript ContourGAN: Image contour detection with generative adversarial network Hongju Yang, Yao Li, Xuefeng Yan, Fuyuan Cao
PII: DOI: Reference:
S0950-7051(18)30484-2 https://doi.org/10.1016/j.knosys.2018.09.033 KNOSYS 4517
To appear in:
Knowledge-Based Systems
Received date : 26 April 2018 Revised date : 17 August 2018 Accepted date : 20 September 2018 Please cite this article as: H. Yang, et al., ContourGAN: Image contour detection with generative adversarial network, Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.09.033 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Highlights (for review)
Highlights • We add the adversarial model to image contour detection to form highlevel edge maps. • We employ all convolutional layers in the encoder stage to improve convergence rates. • We randomly crop 224 x 224 regions and then rescale them to improve fitting ability. • We evaluated the weight during adversarial training and found that an increasing weight leads to the omission of local details • Our proposed solution outperforms RCF, proving the benefits of adversarial training.
1
*Revised Manuscript (Clean Version) Click here to view linked References
ContourGAN: Image Contour Detection with Generative Adversarial Network Hongju Yanga,∗, Yao Lia , Xuefeng Yanb , Fuyuan Caoa a Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education and Information Technology, Shanxi University,Taiyuan,030006, Shanxi, China b School of Mechanical Engineering, Tianjin University of Technology and Education, Tianjin,300222,China
Abstract We propose a convolutional encoder–decoder framework to extract image contours supported by a generative adversarial network to improve the contour quality. Traditional image-to-image models only consider the loss between prediction and ground truth, neglecting the similarity between the data distribution of the outcomes and ground truth. Based on this observation, the proposed generative adversarial network aims to increase the detection accuracy. The resulting method contains two models, namely an encoder–decoder model, whose weights are updated using binary cross-entropy loss with fine-tuning from the VGG16 pre-trained model, and a discriminator network that employs ground truths and predicted contours as input for discrimination. We evaluate the method on the Berkeley Segmentation Data Set and Benchmarks 500 and NYUD dataset, achieving state-of-the-art performance with an ODS F-measure of 0.810 and 0.715. Keywords: contour detection; deep learning; generative adversarial networks; convolutional neural networks
∗ Corresponding
author Email address:
[email protected] (Hongju Yang)
Preprint submitted to Journal of LATEX Templates
August 17, 2018
1. Introduction Image contour detection is essential in various computer vision applications, including object detection [1–3], image retrieval [4, 5], and image segmentation [6–8]. Contour detection extracts visually salient edges and object boundaries from linear segments in images, thus resembling image enhancement [9]. Conventional contour extraction usually comprises two stages. First, manually designed low-level features are determined from changes in local brightness, colors, gradients, and textures. Second, edge pixels are identified in the image. Although contour extraction methods improve edge detection, feature extraction is a subjective and sometimes unreliable process. Moreover, boundaries and edges of objects often include meaningful semantic information. However, low-level features may impede the representation of high-level semantic boundaries. To address these problems, approaches such as structured edge detection [10] and the gPb algorithm [11] capture global representations of images. Convolutional neural networks (CNNs) have demonstrated high performance for image classification [12–14], object detection [3, 15, 16], semantic segmentation [17], and other applications. Moreover, the ability to determine meaningful features from images is increasing the use of CNNs in edge detection. For instance, contour detection based on CNNs includes methods such as DeepEdge [18], N4-Fields [19], DeepContour [20], fully convolutional encoder–decoder (FCED) [21], HED [22], and RCF [23]. Accordingly, we propose an adversarial networkbased contour detection method. CNN-based methods have enabled substantial breakthroughs in computer vision and deliver strong interpretability. As reported in [23, 24], feature maps at different layers can capture diverse information, from local textures to global object outline. Unlike previous methods that only use the pooling layers for edge detection [21, 22], we employ all the convolutional layers of the encoder to extract edge information from enhanced features. In the proposed method, a generative adversarial network (GAN) [25] composed of generator and discriminator supports edge detection. Specifically, the
2
generator is trained to produce image contours, and the discriminator is trained to detect synthetic images from the generator. Therefore, the GAN [25] aims to increase the similarity between generated and real images. We consider an encoder–decoder [26] model for the generator to extract image contours. Then, we employ a discriminator network to distinguish between the ground truth and extracted image contours, thus improving the edge detection performance. The proposed method extracts edges and uses the foreground texture to tune the weights of the GAN loss for improved accuracy. The main contributions from this study can be summarized as follows: • We developed a GAN-based method to extract image contours using foreground texture rather than noise pixels appearing in the background, thus outperforming current methods. • We evaluated the weight during adversarial training and found that an increasing weight leads to the omission of local details. The remainder of this paper is organized as follows. Section 2 surveys different approaches of contour detection emphasizing the use of deep convolutional GANs. Section 3 details the proposed method, called ContourGAN, including the generator and discriminator architecture. Sections 4 and 5 present the performance evaluation of the proposed method against similar methods under varying parameters. Finally, we draw conclusions in section 7.
2. Related work Contour detection is essential for most applications of computer vision, and thus numerous investigations are available on this topic. The evolution of contour detection mainly consists of early pioneering methods, classification methods built on handcrafted features, and the recently developed deep learning methods. In this section, we present some representative works on edge detection followed by an overview of GANs [25].
3
Image contour detection: Prior to deep learning models, contour extraction mainly relied on the calculation of local image gradients of color and intensity. For instance, the Sobel operator [27] is a discrete differentiator to compute an image approximation gradient map and obtain edges by subsequent thresholding. The Canny edge detector [28], which is based on the Sobel operator, uses Gaussian smoothing to reduce noise, a bithreshold to extract image edges, and finally non-maximum suppression for edge refinement. Next, some researchers have employed sophisticated frameworks to discriminate edge and non-edge pixels by extracting image features such as intensity, gradient, and texture. Martin et al. [29] interpreted changes in brightness, color, and texture as Pb features, and subsequently trained a classifier to determine whether a pixel belongs to an edge. Those handcrafted features promoted the development of edge feature extraction. However, a substantial gap still existed between human recognition abilities and computational intelligence. In fact, given that those data-driven methods only focused on images, underlying information from the ground truth was neglected. In recent years, CNNs have fostered developments in computer vision and have been widely applied to contour detection, achieving remarkable results. For instance, Ganin et al. [19] proposed an N4-field approach to process input images patch-by-patch and determine image edges. The input image passes through a CNN, whose feature vectors are matched to a dictionary with known annotations. Hwang et al. [30] used DenseNet [31] to extract a feature vector per pixel for classification using a support vector machine. Shen et al. [20] derived a positive-sharing loss to learn discriminative features for contour detection and clustered positive pixels into different subclasses for fitting using different model parameters. An efficient and accurate edge detector framework was proposed by Xie et al. [22], in which raw images were transformed into the corresponding edge information, and feature maps in every pooling layer were used to generate nested contours. This technique demonstrated promising performance in terms of F-measures. Based on [22], Liu et al. [23] used CNN features from all the convolutional layers for edge detection. Similarly, Yang et al. [21] developed an 4
FCED network with refined ground truth for contour detection. Unlike previous deep learning methods [19, 20, 32] that belong to patch-based categories, the approach in [21] employs an image-to-image framework for contour extraction. GAN overview: CNNs rely on supervised learning that requires abundant image training data and ground truth information to determine their relationship. Generating training datasets usually demands considerable effort and material resources. Fortunately, the GAN [25, 33] represents an effective unsupervised learning method that generates and classifies synthetic images from Gaussian distributions using a generator and a discriminator, respectively. The generator constructs synthetic images and the discriminator distinguishes between real and synthetic images. Some GAN applications for image processing are summarized in the following. Denton et al. [34] proposed cascading GANs to construct a Laplacian pyramid framework and then classify high-quality images. Scott et al. [35] proposed an image-to-image framework that embeds a CNN architecture into GAN models to generate high-resolution images for quality enhancement. Pan et al. [36] used a GAN to evaluate visual saliency in images, outperforming state-of-the-art methods. Inspired by [35, 36], we employ a GAN framework to refine the extracted image contours, as detailed below.
3. ContourGAN In this section, we introduce the proposed method, CountourGAN, and detail its loss function. The method uses a network to automatically learn nonlinear maps between raw images and their contours. To achieve this, we design a generator network to extract image contours and an adversarial model to refine the results. 3.1. Overview Therefore, CountourGAN is composed of a generator and a discriminator, with the former being an encoder–decoder model to extract edge information of
5
Generator Network
Conv4
Conv5
Conv6
Conv7
50x50x512
50x50x512
50x50x32
50x50x32
Conv3
Up1 Up2
Conv8
100x100x256
100x100x32
Conv9
Conv2 200x200x128
Encoder
Contour 400x400x1
Conv10
Decoder
400x400x16
400x400x64
Predicted Contour
Conv-Encoder Pooling
Discrimintor Network
3 x 3 filter Concat and Upsample
Conv-Decoder
1 x 1 x1 filter
Deconv-Decoder
FC(1)
FC(100)
FC(1024)
BN
Leaky ReLU
conv
Input
Up3
200x200x16
Conv1
Figure 1: Architecture of generator and discriminator networks in ContourGAN. input images, and the latter a CNN that distinguishes the generated contours from the ground truth. As shown in Fig. 1, the encoder down-samples the input image by means of a max-pooling layer, whereas the decoder up-samples feature maps computed from the final layer in the encoder. The maps are obtained by transposing convolutional layers to guarantee consistency with the input size. Each convolutional layer in the encoder is connected to a corresponding convolutional layer in the decoder. In the Fig. 1, we indicate only one connection across the convolutional stage for illustrative purposes. For examples, in the connection between conv1 and U p3, there are two convolutional layers in the first stage. Two lines with arrows illustrate the connection, whereas the common connection is depicted using one line. 3.1.1. Generator Encoder: As mentioned above, the generator comprises an encoder and a decoder. The encoder is a variant of the VGG16 model [13], which is trained using an ImageNet dataset designed for object classification [37]. We implemented the following modification to this model: • The final pooling layer, pool5, is omitted, and the fully connected layers are removed, as the small size of pool5 is not suitable to capture meaning6
ful information if compared with other feature maps and can undermine edge localization, whereas the fully connected layers are unsuitable for extracting spatial information. • The size of the stride in pool4 is changed from 2 to 1, and an atrous convolutional layer is employed to increase the size of the receptive field in conv5. • Each conv layer in the VGG16 model is connected to a conv layer with kernel size of 3×3 and channel depth of 32 or 16. The resulting conv layers in each stage are concatenated in the channel dimension with up-sampling layers in the decoder. Decoder: The decoder, which contains four stages, is used to rescale feature maps created in the encoder to their original size. First, six convolutional layers of conv6 and conv7 with a size of M × N × D are combined. The outcome is upsampled by a factor of two to obtain feature map U p1 with a size of 2M ×2N ×D. Next, conv8 is concatenated with U p1, and conv9 with U p2. The concatenation and up-sampling by a factor of two correspond to U p2 and U p3. Finally, U p3 is concatenated with Conv10, which is followed by a 1×1×1 conv layer to generate the image contours. Every conv layer is activated by a leaky rectified linear unit operation, except for the last layer that is activated by a sigmoid function. The up-sampling operation is implemented by a transposed convolutional layer. The convolutional kernels are initialized using bilinear interpolation, and the weights are fixed during training. 3.1.2. Discriminator The discriminator comprises a classification network to distinguish the predicted contours and ground truth. Its architecture is illustrated in Fig. 1 and consists of nine convolutional layers with a kernel of 3 × 3, and the channel depth is increased by a factor of two from 64 to 512, as in the VGG16 model [13]. Each conv layer is followed by a batch norm layer and activated by a leaky
7
rectified linear unit operation. The final conv layer is followed by two fully connected layers, whereas a final layer activated by the sigmoid function retrieves probabilities for classification. 3.2. Loss function The proposed method for contour detection is formulated as follows. Given input images and their ground truths, the input data and their labels are represented with {(In , Cn ), n = 1, 2, 3..., N }, where In denotes the raw input images and Cn ∈ {0, 1} the corresponding ground truth, with 0 and 1 corresponding to a pixel classified as edge and non-edge, respectively. For brevity, we drop subscript n in the following descriptions. Discriminator network DθD is optimized in an alternating manner along with generator GθG to solve the adversarial minimax problem. In Eqn. (1), C and C˜ = GθG (I) represent ground truth and generated synthetic images, respectively, and I is the input data of the generator. This equation represents a generative model GθG that aims to deceive a differentiable discriminator DθD , which is trained to distinguish the generated edges from the ground truth. min max EC∼ptrain(C) [log DθD (C)]+ θG
θD
(1)
EI∼pG(I) [log(1 − DθD (GθG (I))] 3.2.1. Content loss Content loss is a pixelwise value that considers the predicted edges and ground truth. Unlike traditional classification, edge classification is highly imbalanced in terms of the number of edge pixels. Hence, appropriate weights are applied to positive and negative pixels. In Eqn. (2), γ and β are the weights of non-edge and edge pixels, respectively, with γ = |Y− |/|Y | and β = |Y+ |/|Y |, |Y− | and |Y+ | are the numbers of non-edge and edge pixels, respectively. We consider the classification loss as the content loss implemented using binary
8
cross-entropy, as expressed in Eqn. (2). N ∑ ˜ =−1 LBCE (C, C) γCj log C˜j + β(1 − Cj ) log (1 − C˜j ) N n=1
.
(2)
3.2.2. Adversarial loss In addition to content loss in Eqn. (2), we consider the adversarial loss given by Eqn. (3), which allows to estimate the probability of the predicted image being similar to the available edge information. Specifically, DθD (GθG (I)) is the estimated probability that the generated edge map is a ground truth, and hence the adversarial loss increases when the discriminator has a strong distinguishing ability. Eqn. (4) defines the loss function of ContourGAN, which combines the content and adversarial losses to improve the detection and classification performance, and α denotes the weight of the adversarial loss. During training, the generator weight is pre-trained using content loss, and then the discriminator is added for adversarial training. The generator and discriminator are alternatively updated at every training iteration and ℓ2 regularization is used for the weights.
LG adv = LGAN =
N ∑
n=1
− log DθD (GθG (I))
LBCE + | {z }
content loss
3.3. Comparison with FCED
αLG | {zadv}
adversarial loss
+
(3)
λ||θ||2 | {z }
(4)
regularization
The proposed generator is similar to the FCED structure [21], which utilizes the encoder–decoder model to produce edge information. However, four differences between these structures arise. First, fully connected layers integrate the encoder and decoder in FCED, whereas ContourGAN omits such layers. Second, in ContourGAN, conv5 1, conv5 2, and conv5 3 are atrous convolutional layers and the stride in pool4 is 1 instead of 2 as in FCED. Third, up-sampling
9
in FCED is implemented by finding the maximum value and then up-sampling through pooling layers, whereas deconvolutional layers with bilinear kernels are employed in ContourGAN. Finally, all the convolutional layers in ContourGAN aim to refine the predicted results, whereas only the feature maps of the pool layers are used in FCED.
4. Experiments In this section, we detail the implementation of ContourGAN and evaluate its performance. 4.1. ContourGAN implementation We used the TensorFlow framework for deep learning to implement the proposed ContourGAN. The weights in the encoder were initialized using the VGG16 model [13], and for the decoder, the He initialization was used [38]. Optimization during training relied on adaptive moment estimation. Specifically, the minibatch parameter was set to 8, weight α to 0.0001, global learning rate to 1e–5, and weight decay to 2e–5. Adaptive moment estimation was performed for 40,000 iterations. Then, every image was rescaled to 400 × 400 to reduce the GPU usage. Moreover, data augmentation extended the training set by a factor of 32 [22, 39] and some local regions were cropped to improve the evaluation performance. To refine the predicted contours, standard non-maximum suppression [10] was used to improve accuracy. The Berkeley Segmentation Dataset and Benchmarks 500 (BSDS500) [11]) and NYU depth (NYUD) dataset were used to evaluate the proposed method over three standard measures: fixed contour threshold (ODS), average precision (AP), and per-image best threshold (OIS). All the experiments were conducted using an NVIDIA GTX 1070 GPU. 4.2. Evaluation on BSDS500 We first compared the proposed ContourGAN to other methods on the BSDS500, which is composed of 200 training, 200 testing, and 100 validation images. Each image has between five and six manually annotated labels representing the 10
1.0
0.8
Precision
0.6
0.4
0.2
0.0 0.0
[F=.810] ContourGAN-MS [F=.802] ContourGAN [F=.797] generator [F=.793] RCF [F=.767] HFL [F=.757] deepcontour [F=.753] deepedge [F=.746] OEF [F=.744] MCG [F=.743] SE [F=.634] Ncut [F=.614] EGB [F=.611] canny [F=.598] Mshift 0.2
0.4
Recall
0.6
0.8
1.0
Figure 2: Evaluation results on BSDS500. The proposed ContourGAN achieves the highest performance (ODS = 0.802).
ground truth. If three contributors indicate that a pixel should be regarded as an edge, it is classified as a positive sample, and if a pixel is identified as a negative sample by all the contributors, it is classified as a negative sample. The pixels not meeting any of these conditions are disregarded from the weight update. We also append the PASCAL VOC Context dataset [40] to the BSDS500 for training. For evaluation, the contour results of test images were obtained by the generator, and standard non-maximum suppression [10] was subsequently applied to the fine predicted contours. The comparison was performed against
11
Table 1: results on BSDS500 [11]; † GPU time method
ODS
OIS
FPS
Canny [41]
.611
.676
16
N4-Fields [19]
.753
.769
1/6
DeepEdge [18]
.753
.772
1/1000†
DeepContour [20]
.756
.773
1/29†
DeepNet [46]
.738
.759
-
HFL [45]
.767
.788
-
RCF [23]
.793
.812
29†
Generator
.797
.828
18†
ContourGAN
.802
.831
18†
non-deep-learning algorithms: Canny edge detector [41], EGB [42], MShift [43], NCut [43], SE [10], and OEF [44], and various methods based on CNNs: HFL [45], RCF [23], DeepContour [20], DeepEdge [18], N4-Fields [19], and DeepNet [46]. Fig. 2 shows the evaluation results. Compared with RCF [23], the performance improves from ODS = 0.793 to ODS = 0.797 when only using the generator of the proposed method. This improvement might be due to the gradual double-upscaling of the feature maps, from 50 × 50 to 400 × 400, whereas RCF directly rescales the feature maps in different conv layers to 400 × 400, thus disregarding some features. Moreover, ContourGAN uses all the convolutional layers containing hierarchical information to ensure high performance. When adversarial training is used, the evaluation result improves from ODS = 0.797 to ODS = 0.802, and the local background noise pixels are ignored. To further improve the quality of edge extraction, a multiscale algorithm [23] was used during validation, retrieving a performance improvement of 0.8% over single scaling. The statistical results reported in Table 1 show that CNN-based methods outperform conventional methods. Compared with RCF [23], the ODS and OIS 12
Image (gt)
Contours
Up1
Up2
Up3
Figure 3: Feature maps on intermediate layers. From top to bottom, the first row shows outcomes from adversarial loss and the second one shows outcomes from content loss. From left to right, the five images represent the original image, ground truth, feature maps Up1, Up2, and Up3.
F-measures of ContourGAN are 0.9% and 1.9% higher, respectively, because GANs consider the global object boundaries rather than local textures. Thus, the proposed method effectively disregards noise pixels to produce high-quality image contours. However, the FPS in ContourGAN is lower than that in RCF, given limitations in the symmetric structure of the encoder–decoder model and the additional convolutional filters. Specifically, the number of required filters is 32 in conv6, conv7, and conv8, but only 21 are required in RCF. More conv layers are required to reconstruct the feature maps to the original image scale. To investigate the difference between content and adversarial loss in edge maps, the feature maps at each layer were determined in the decoder. The feature maps are represented by a three-dimensional vector I ∈ RW,H,D , where W and H represent the width and height, respectively, and D denotes the channel depth. To simplify visualization, cross-dimensional weighting [47] was applied to aggregate the feature maps into two-dimensional vectors I ∈ RW,H . Fig. 3 shows real images and ground truths located in the first column. The extracted image contours and feature maps are determined by adversarial and content loss in the first and second rows, respectively. The pixels highlighted as red rectangles are background noise pixels, which are overlooked by adversarial training, demonstrating the effectiveness of ContourGAN for edge detection. 13
Original Image
Ground Truth
RCF
Generator
ContourGAN
Figure 4: Results on BSDS500. Each instance shows the original image, ground truth, RCF [23], content loss, and ContourGAN outcomes. The red rectangle outlines the local region exhibiting high variability.
In Fig. 4, we show four raw images from BSDS500, followed in the subsequent columns by ground truth, RCF [23] edge map, content loss-based edge map, and the ContourGAN outcome. The examples in the top row present simple textures compared with those in the third and fourth rows. The red rectangle delimits the image region exhibiting high variability. The background noise pixels are ignored, i.e., “grass” and “stone” pixels are not considered in the edge maps. In the other images, the extracted results using ContourGAN are superior to those generated by RCF in some local regions, i.e., the noise pixels located in “photo frame” are ignored, and the outline of “house” is clearer [23]. This higher performance confirms that the ContourGAN determines high-quality and clear edge maps for contour detection.
5. ContourGAN parameter evaluation Hyperparameter α in Eqn. (4) affects the proposed CountourGAN behavior, and hence different α values were evaluated. Specifically, four different α values 14
were used to investigate the influence of adversarial loss on content loss. Fig. 5 shows four BSDS500 instances with different α values. Hyperparameter α clearly influences the edge map extraction, as lower values imply a higher quality of the extracted contours. This occurs because the GAN forces the generated synthetic samples to tend toward the distribution of the real images with a high probability of images neglecting background pixels. Furthermore, although the improvements provided by ContourGAN are inconspicuous, the noise pixels are ignored from the edge maps when using adversarial training, which makes the data distribution of ground truth similar to that of the predicted results.
Original image
α=0.1 (ODS=0.775)
α=0.01(ODS=0.782)
α=0.001(ODS=0.795) α=0.0001(ODS=0.802)
Figure 5: Edge maps with different α values and their ODS values. We also evaluated the size of the transformed convolutional filters between the encoder and decoder. The transformed filters are used to connect the feature maps generated by the encoder with their counterparts in the decoder. RCF [23] uses 1 × 1 convolutional filters to implement a dimension reduction. In contrast, the VGG16 model [13] uses 3 × 3 filters to enhance discriminability for local patches. To assess the difference in the filter dimension, the 3 × 3 filters used during previous experiments were replaced by 1 × 1 filters. Fig. 6 shows that the local details from background noise pixels are reflected by 1 × 1 filters. Hence, these filters may not be appropriate for edge extraction, which
15
requires information among adjacent pixels to determine edges, rather than just the dimension integration among different channels. Finally, we evaluated the difference between the multilevel and encoder– decoder frameworks. The multilevel framework was introduced in HED [22] and directly up-samples feature maps at every convolutional stage to the original image size. Likewise, the encoder–decoder framework can be used to generate images by up-sampling. Fig. 7 shows the generated results of multilevel RCF [23] and the encoder–decoder framework. The local texture predicted using the encoder–decoder is clearer than that using RCF [23], thus confirming that the former outperforms the multi-level framework.
Original Image
1x1 Filter
3x3 Filter
Original Image
RCF
Encoder-decoder
Figure 6: Illustration of edge maps
Figure 7: Image contours extracted by
with different filter sizes of 3 × 3 and
multilevel RCF and encoder–decoder
1 × 1.
framework.
6. Evaluation on NYUD dataset In this section, we present the evaluation of CountourGAN on the NYUD dataset, which comprises video sequences from indoor scenes recorded using RGB–depth cameras. Unlike BSDS500, the NYUD dataset contains densely labeled pairs. Recently, this dataset has been widely used for edge extraction. For instance, Gupta et al. [48] split the dataset into 795 training and 654 testing images to evaluate an edge extraction method. Both HED and RCF employ depth information using the HHA representation that describes depth information using three channels. However, as depth information rarely conveys useful contour information, we only train the model 16
Original Image
Ground Truth
RCF
α=1
α=0.001
α=0.0001
Figure 8: Edge maps with different α values.
using RBG images. In addition, we rotate the images and ground truths in four different angles and flip them at each angle for training. The other settings are the same as those used for the BSDS500. For testing, we cropped the images to 640 × 420 pixels to fit the 16 inputs of the network. For evaluation, we increased the localization tolerance from 0.0075 to 0.011 as NYUD images are larger than those in the BSDS500.
Figure 10: results on NYUD [48]
1.0
0.8
Precision
0.6
0.4
0.2
0.0 0.0
[F=.715] ContourGAN [F=.712] RCF [F=.706] SE+NG+ [F=.695] SE [F=.687] gPb+NG [F=.651] OEF [F=.631] gPb-UCM 0.2
0.4
Recall
0.6
0.8
method
ODS
OIS
OEF [44]
.650
.667
gPB-UCM [11]
.631
.661
gPB+NG [48]
.687
.716
SE [10]
.685
.699
SE+NG+ [49]
.710
.723
RCF [23]
.712
.722
ContourGAN
.715
.731
1.0
Figure 9: Results on NYUD.
We compared the proposed method with some state-of-the-art methods. Fig. 9 shows the resulting precision–recall curves. The proposed ContourGAN achieves the best performance among the compared methods on the NYUD dataset. In addition, extracted edges are shown in Fig. 8, and edge maps are
17
compared with those of RCF for different α values of 1, 0.001 and 0. By improving the α value, most of the noise pixels are not considered as edges. However, the best performance on the NYUD dataset is achieved for an α value smaller than that used for the BSDS500, because the textures of NYUD images are denser than those of BSDS500 images. Table. 10 shows a statistical comparison among the methods, where the proposed ContourGAN achieves the best results. 7. Conclusion In this paper, we propose an end-to-end method for image contour detection, called ContourGAN, which relies on a GAN to refine the predicted results. In addition, all the convolutional layers in the encoding stage are used to predict image contours. As a result, ContourGAN retrieves higher quality than other state-of-the-art contour extraction methods. However, the predicted edge results in some complex local regions exhibit a comparatively low quality. In future developments, we will consider these complex local regions more thoroughly. The source code of ContourGAN is available at https://github.com/sxdxedu /ContourGAN. Acknowledgments The authors would like to thank the Elseviers English Language Editing service for the assistance. This work was supported by the National Natural Science Foundation of China (no. 61573229) and the Shanxi Provincial Science and Technology Department (no. 2015011048). References [1] R. B. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the 27th Conference on Computer Vision and Pattern Recognition,Columbus, OH, 2014, pp. 580–587. doi:10.1109/CVPR.2014.81. 18
[2] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, in: Proceedings of the 13th European Conference on Computer Vision, Part III,Zurich,Switzerland, 2014, pp. 346–361. doi:10.1007/978-3-319-10578-9\ 23. [3] J. Redmon, S. K. Divvala, R. B. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: Proceedings of the 29th Conference on Computer Vision and Pattern Recognition,Las Vegas, NV, 2016, pp. 779–788. doi:10.1109/CVPR.2016.91. [4] H. Yang, K. Lin, C. Chen, Supervised learning of semantics-preserving hash via deep convolutional neural networks, IEEE Trans. Pattern Anal. Mach. Intell. 40 (2) (2018) 437–451. doi:10.1109/TPAMI.2017.2666812. [5] K. Lin, H. Yang, J. Hsiao, C. Chen, Deep learning of binary hash codes for fast image retrieval, in: Proceedings of the 28th Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, 2015, pp. 27–35. doi:10.1109/CVPRW.2015.7301269. [6] P. Arbelaez, M. Maire, C. C. Fowlkes, J. Malik, From contours to regions: An empirical evaluation, in: Proceedings of the 22th Conference on Computer Vision and Patternm Recognition Miami, Florida, 2009, pp. 2294– 2301. doi:10.1109/CVPRW.2009.5206707. [7] P. A. Arbel´aez, J. Pont-Tuset, J. T. Barron, F. Marqu´es, J. Malik, Multiscale combinatorial grouping, in: Proceedings of the 27th Conference on Computer Vision and Pattern Recognition, Columbus, OH, 2014, pp. 328– 335. doi:10.1109/CVPR.2014.49. [8] Y. Wei, X. Liang, Y. Chen, X. Shen, M. Cheng, Y. Zhao, S. Yan, STC: A simple to complex framework for weakly-supervised semantic segmentation, CoRR abs/1509.03150. [9] W. Wang, X. Yuan, Recent advances in image dehazing, IEEE/CAA Journal of Automatica Sinica 4 (3) (2017) 410–436. 19
[10] P. Doll´ar, C. L. Zitnick, Fast edge detection using structured forests, IEEE Trans. Pattern Anal. Mach. Intell. 37 (8) (2015) 1558–1570. doi:10.1109/ TPAMI.2014.2377715. [11] P. Arbelaez, M. Maire, C. C. Fowlkes, J. Malik, Contour detection and hierarchical image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (5) (2011) 898–916. doi:10.1109/TPAMI.2010.161. [12] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems 25, 2012, pp. 1106–1114. [13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the 28th Conference on Computer Vision and Pattern Recognition, Boston, MA, 2015, pp. 1–9. doi:10.1109/CVPR.2015. 7298594. [14] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the 29th Conference on Computer Vision and Pattern Recognition,Las Vegas, NV, 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90. [15] S. Ren, K. He, R. B. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, in: Advances in Neural Information Processing Systems 28, 2015, pp. 91–99. [16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, A. C. Berg, SSD: single shot multibox detector, in: Proceedings of the 14th European Conference on Computer Vision Part I, Amsterdam, Netherlands, 2016, pp. 21–37. doi:10.1007/978-3-319-46448-0\ 2. [17] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the 28th Conference on Com-
20
puter Vision and Pattern Recognition, Boston, MA, 2015, pp. 3431–3440. doi:10.1109/CVPR.2015.7298965. [18] G. Bertasius, J. Shi, L. Torresani, Deepedge: A multi-scale bifurcated deep network for top-down contour detection, in: Proceedings of the 28 th Conference on Computer Vision and Pattern Recognition, Boston, MA, 2015, pp. 4380–4389. doi:10.1109/CVPR.2015.7299067. [19] Y. Ganin, V. S. Lempitsky, Nˆ4 -fields: Neural network nearest neighbor fields for image transforms, in: Proceedings of the 12th Asian Conference on Computer Vision,Part II, Singapore, Singapore, 2014, pp. 536–551. doi: 10.1007/978-3-319-16808-1\ 36. [20] W. Shen, X. Wang, Y. Wang, X. Bai, Z. Zhang, Deepcontour: A deep convolutional feature learned by positive-sharing loss for contour detection, in: Proceedings of the 28th Conference on Computer Vision and Pattern Recognition, Boston, MA, 2015, pp. 3982–3991. doi:10.1109/CVPR.2015. 7299024. [21] J. Yang, B. L. Price, S. Cohen, H. Lee, M. Yang, Object contour detection with a fully convolutional encoder-decoder network, in: Proceedings of the 29th Conference on Computer Vision and Pattern Recognition,Las Vegas, NV, 2016, pp. 193–202. doi:10.1109/CVPR.2016.28. [22] S. Xie, Z. Tu, Holistically-nested edge detection, in: Proceedings of the 14th International Conference on Computer Vision, Santiago, Chile, 2015, pp. 1395–1403. doi:10.1109/ICCV.2015.164. [23] Y. Liu, M. Cheng, X. Hu, K. Wang, X. Bai, Richer convolutional features for edge detection, in: Proceedings of the 30th Conference on Computer Vision and Pattern Recognition, Boston, MAHonolulu, HI, 2017, pp. 5872– 5881. doi:10.1109/CVPR.2017.622. [24] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: Proceedings of the 13th European Conference on Computer 21
Vision, Part I, Switzerland, Zurich, 2014, pp. 818–833. doi:10.1007/ 978-3-319-10590-1\ 53. [25] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, CoRR abs/1511.06434. arXiv:1511.06434. [26] Z. Wang, L. Wang, S. Liu, G. Wei, Encoding-decoding-based control and filtering of networked systems: insights, developments and opportunities, IEEE/CAA Journal of Automatica Sinica 5 (1) (2018) 3–18. [27] I. E. Sobel, Camera models and machine perception, Ph.D. thesis, Stanford, CA, USA, aAI7102831 (1970). [28] J. F. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. 8 (6) (1986) 679–698. doi:10.1109/TPAMI. 1986.4767851. [29] D. R. Martin, C. C. Fowlkes, J. Malik, Learning to detect natural image boundaries using local brightness, color, and texture cues, IEEE Trans. Pattern Anal. Mach. Intell. 26 (5) (2004) 530–549. doi:10.1109/TPAMI. 2004.1273918. [30] J. Hwang, T. Liu, Pixel-wise deep learning for contour detection, CoRR abs/1504.01989. arXiv:1504.01989. [31] G. Huang, Z. Liu, K. Q. Weinberger, Densely connected convolutional networks, CoRR abs/1608.06993. [32] F. N. Iandola, M. W. Moskewicz, S. Karayev, R. B. Girshick, T. Darrell, K. Keutzer, Densenet: Implementing efficient convnet descriptor pyramids, CoRR abs/1404.1869. [33] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems 27, 2014, pp. 2672–2680. 22
[34] E. L. Denton, S. Chintala, R. Fergus, et al., Deep generative image models using a laplacian pyramid of adversarial networks, in: Advances in neural information processing Systems 28, 2015, pp. 1486–1494. [35] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, W. Shi, Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings of the 30th Conference on Computer Vision and Pattern Recognition,Honolulu, HI, 2017, pp. 105–114. doi:10.1109/CVPR.2017.19. [36] J. Pan, C. Canton-Ferrer, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol, X. Gir´o i Nieto, Salgan: Visual saliency prediction with generative adversarial networks, CoRR abs/1701.01081. arXiv:1701.01081. [37] J. Deng, W. Dong, R. Socher, L. Li, K. Li, F. Li, Imagenet: A largescale hierarchical image database, in: Proceedings of the 22th Conference on Computer Vision and Pattern Recognition, Miami, Florida, 2009, pp. 248–255. doi:10.1109/CVPRW.2009.5206848. [38] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in: Proceedings of the 27th Conference on Computer Vision and Pattern Recognition, Boston, MA, 2015, pp. 1026–1034. doi:10.1109/iccv.2015.123. [39] Y. Liu, M. S. Lew, Learning relaxed deep supervision for better edge detection, in: Proceedings of the 29th Conference on Computer Vision and Pattern Recognition,Las Vegas, NV, 2016, pp. 231–240. doi:10.1109/ CVPR.2016.32. [40] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, A. L. Yuille, The role of context for object detection and semantic segmentation in the wild, in: Proceedings of the 27th Conference on Computer Vision and Pattern Recognition,Columbus, OH, 2014, pp. 891–898. doi:10.1109/ CVPR.2014.119.
23
[41] J. F. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. 8 (6) (1986) 679–698. doi:10.1109/TPAMI. 1986.4767851. URL https://doi.org/10.1109/TPAMI.1986.4767851 [42] P. F. Felzenszwalb, D. P. Huttenlocher, Efficient graph-based image segmentation, International Journal of Computer Vision 59 (2) (2004) 167– 181. doi:10.1023/B:VISI.0000022288.19776.77. [43] D. Comaniciu, P. Meer, Mean shift: A robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (5) (2002) 603– 619. doi:10.1109/34.1000236. [44] S. Hallman, C. C. Fowlkes, Oriented edge forests for boundary detection, in: Proceedings of the 27th Conference on Computer Vision and Pattern Recognition, Boston, MA, 2015, pp. 1732–1740. doi:10.1109/CVPR.2015. 7298782. [45] G. Bertasius, J. Shi, L. Torresani, High-for-low and low-for-high: Efficient boundary detection from deep object features and its applications to high-level vision, in: Proceedings of the 14th International Conference on Computer Vision,Santiago, Chile, 2015, pp. 504–512.
doi:
10.1109/ICCV.2015.65. [46] J. J. Kivinen, C. K. I. Williams, N. Heess, Visual boundary prediction: A deep neural prediction network and quality dissection, in: Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, 2014, pp. 512–521. [47] Y. Kalantidis, C. Mellina, S. Osindero, Cross-dimensional weighting for aggregated deep convolutional features, in: Proceedings of the 14th European Conference on Computer Vision Workshops, Part I, Amsterdam, Netherlands, 2016, pp. 685–701. doi:10.1007/978-3-319-46604-0\ 48.
24
[48] S. Gupta, P. Arbelaez, J. Malik, Perceptual organization and recognition of indoor scenes from RGB-D images, in: Proceedings of the 26th Conference on Computer Vision and Pattern Recognition, Portland, OR, 2013, pp. 564–571. doi:10.1109/CVPR.2013.79. [49] S. Gupta, R. B. Girshick, P. A. Arbel´aez, J. Malik, Learning rich features from RGB-D images for object detection and segmentation, in: Proceedings of the 13th European Conference,Part VII, Zurich,Switzerl, 2014, pp. 345– 360. doi:10.1007/978-3-319-10584-0\ 23.
25