Saliency-guided level set model for automatic object segmentation

Saliency-guided level set model for automatic object segmentation

Pattern Recognition 93 (2019) 147–163 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/patcog...

9MB Sizes 0 Downloads 74 Views

Pattern Recognition 93 (2019) 147–163

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/patcog

Saliency-guided level set model for automatic object segmentation Qing Cai a,b,∗, Huiying Liu a, Yiming Qian b, Sanping Zhou c, Xiaojun Duan d, Yee-Hong Yang b a

School of Automation, Northwestern Polytechnical University, Youyi West Road No. 127, Shanxi, Xi’an, 710072 China Department of Computing Science, University of Alberta, Edmonton, Alberta, T6G 2E8 Canada c The Institute of Artificial Intelligence and Robotic, Xi’an Jiaotong University, Xianning West Road No. 28, Shanxi, Xi’an, 710049 China d National Key Laboratory of UVA Technology, Northwestern Polytechnical University, Youyi West Road No. 127, Shanxi, Xi’an, 710072 China b

a r t i c l e

i n f o

Article history: Received 18 May 2018 Revised 22 March 2019 Accepted 23 April 2019 Available online 24 April 2019 Keywords: Level set model Object segmentation Visual saliency Graph cuts Automatic initialization

a b s t r a c t The level set model is a popular method for object segmentation. However, most existing level set models perform poorly in color images since they only use grayscale intensity information to defined their energy functions. To address this shortcoming, in this paper, we propose a new saliency-guided level set model (SLSM), which can automatically segment objects in color images guided by visual saliency. Specifically, we first define a global saliency-guided energy term to extract the color objects approximately. Then, by integrating information from different color channels, we define a novel local multichannel based energy term to extract the color objects in detail. In addition, unlike using a length regularization term in the conventional level set models, we achieve segmentation smoothness by incorporating our SLSM into a graph cuts formulation. More importantly, the proposed SLSM is automatically initialized by saliency detection. Finally, the evaluation on public benchmark databases and our collected database demonstrates that the new SLSM consistently outperforms many state-of-the-art level set models and saliency detecting methods in accuracy and robustness. © 2019 Published by Elsevier Ltd.

1. Introduction Object segmentation, a process to partition an image into foreground and background, plays a significant role in computer vision, especially for color images. Because of the reducing cost in color sensors, color cameras appear from high-end SLR digital cameras to low cost smartphones, webcams, dashcams, body cams, etc. Furthermore, as a preprocessing step, the quality of color object segmentation significantly affects many computer vision applications, for example, object retrieval [1], medical diagnosis [2] and video cutout [3]. More specifically, for example, color image segmentation can help to provide accurate location of pathological tissue for surgery and to improve the success rate of operation. Most importantly, two regions with different colors but with equal grey intensity cannot be segmented in grey scale because of the absence of intensity difference. Using color can easily alleviate this problem. Hence, the ability to process color images is an important requirement for many image processing applications. The level set model proposed by Osher and Sethian [4] is widely applied in object segmentation since it can handle complex topology of a region us∗ Corresponding author at: School of Automation, Youyi West Road No. 127, Xi’an, 710129 China. E-mail address: [email protected] (Q. Cai).

https://doi.org/10.1016/j.patcog.2019.04.019 0031-3203/© 2019 Published by Elsevier Ltd.

ing an implicit energy function in a high dimension to represent the region’s contour. Currently, many level set models (LSMs) have been proposed and they are broadly divided into two categories: the edge-based LSMs and the region-based LSMs. Just as their names imply, the edge-based LSMs [5–7] mainly use image boundary information, such as gradient, to derive the equation of evolving curve. As a result, they are more suitable for objects with strong boundary. For example, the distance regularized level set evolution model [6] proposed by Li et al. utilizes an edge indicator function constructed using image gradients to guide curve evolution. However, this kind of models may fail to segment objects with weak edges. Different from the edge-based LSMs, the region-based LSMs [8–10] construct the energy function using region information, such as texture and intensity. Thus, they can detect weak edges that cannot be detected using edge-based LSMs. Representatives of this category include the classic Mumford-Shah [8] model and the Chan-Vese (CV) [9] model. However, they fail to segment objects in intensity inhomogeneous images and color images. To address the issue of intensity inhomogeneity, many LSMs [2,11–20] are proposed. The representative models include the local binary fitting (LBF) model [11], the local and global intensity information (LaG_II) model [13], the global and local region active contour (GARAC) [14], the statistical level set approach (SLSA) [15],

148

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

the local hybrid image fitting (LHIF) [16] and the correntropy-based level set method (CLSM) [2], etc. In particular, the SLSA model proposed by Zhang et al. [15] describes inhomogeneous objects as a mixture of Gaussian distributions with different means and variances, and defines a sliding window to transfer the original image to another intensity domain easy to be segmented. The CLSM model proposed by Zhou et al. [2] defines a bias-field-corrected term to estimate the bias field, which simultaneously achieves object segmentation and bias field estimation by incorporating the proposed term into an energy function based on the correntropy criterion. In short, the issue of object segmentation with intensity inhomogeneity is addressed via different methods proposed above. Unfortunately, most of them have poor performance in segmenting color objects because they only use grascale information to define their energy functions. To further address the issue, Dai et al. [21] propose an inhomogeneity-embedded active contour model (InH_ACM) for natural image segmentation and achieves reliably satisfactory segmentation results. The InH_ACM uses a pixel inhomogeneity factor proposed in [22] that is defined for segmenting images with textured objects to construct a local energy function under the CV framework. Thus, it is more suitable for textured object segmentation. However, it may be not suitable to use the intensity information of color images only to defined the global energy term. Although tremendous improvement has been made in object segmentation using LSMs, it is still an issue for LSMs to segment objects in color images since most LSMs only use intensity to define the energy functions. Recently, inspired by the biological ability of humans that can quickly and efficiently identify important information in a complex surrounding, the visual saliency is introduced into the field of computer vision for object detection, object recognition, image retrieval and visual tracking. As a result, many saliency detection methods (SDMs) are proposed and can be roughly categorized into: fixationlevel SDMs [23], object-level SDMs [24–30], objectness proposal SDMs [31] and deep learning based SDMs [32–37]. Due to the success of deep neural networks, the results of state-of-the-art methods in saliency object detection are significantly improved. Nevertheless, it is still an issue for existing SDMs to accurately extracting the object boundaries using the saliency map. Because the saliency map is an object scoring map and object boundaries in the scoring map are blurry [36]. Motived by the issues discussed above, in this paper, we propose a saliency-guided level set model for object segmentation, which avoids the disadvantages that the LSMs have in object segmentation for color images and the SDMs have in their poor capability to locate the real boundaries of objects via the saliency maps, but retains the advantages that the LSMs have in their ability to detect the weak boundaries of objects and the advantages that the SDMs have in detecting salient objects in color images. Firstly, we define a global saliency-guided energy function based on the CV framework to roughly extract the objects. Then, we define a local multichannel based level set model using the CIEL∗ a∗ b∗ color space to extract the boundaries of objects. In order to avoid creating small isolated regions in the final segmentation, we propose a new graph cuts based method which combines the Heaviside function of the level set model with the data term instead of the conventional length regularization term. More importantly, we propose a new automatic initialization method using graph cuts to segment the saliency map to avoid the tedious manual initialization process. Finally, in light of no suitable datasets designed for LSMs, we construct a database of 20 0 0 images that are all collected from public benchmarks for salient object detection to promote the study of object segmentation using the LSMs. Extensive experiments on public benchmark datasets and our datasets validate that the proposed SLSM outperforms most state-of-the-art LSMs for color image object segmentation by a wide margin, and

realizes comparable or even superior performance than many deep learning based SDMs. In summary, the main contributions of this paper include: •







A global saliency-guided energy term is defined using the saliency map to roughly extract the objects, which significantly improves the segmentation efficiency and the robustness to noise and to initialize the SLSM. Unlike most existing level set models using only grayscale information of color images to define the energy function, the proposed local multichannel-based energy term using the CIEL∗ a∗ b∗ color space successfully achieves improved color image segmentation results. A novel graph cuts based method is proposed using the Heaviside function of the level set model to define the data term that can avoid the occurrence of small isolated region in the final segmentation. As a result, it improves the segmentation accuracy. A new automatic initialization method using graph cuts to segment the image saliency map avoids the tedious and timeconsuming manual initialization and further improves the segmentation efficiency.

The rest of the paper is organized as follows: In Section 2, we briefly review the classical level set models, the graph cuts method and saliency detection methods, and discuss their advantages and disadvantages. Our new method is discussed in Section 3. In Section 4, we evaluate our method via extensive experiments on public benchmarks and our database. Section 5 concludes the paper and gives future work. 2. Related work 2.1. Level set models In [9], Chan and Vese propose the region-based CV model under the framework of Mumford-Shah [8], which successfully achieves object segmentation with weak boundaries that cannot be detected using edge-based LSMs, and opens a new direction for LSMs. The energy function of the CV model is defined as:

ECV =

2 

λi

i=1



 

 

|I (x, y ) − ci |2 Mi (φ ) dxdy

|∇ H (φ )| dxdy + ν

 

H (φ ) dxdy,

(1)

where λi (i = 1, 2 ), μ and ν are weights controlling the contributions of different terms. ci (i = 1, 2 ) are constants describing the image intensity inside and outside the evolution curve. M1 (φ ) = H (φ ) and M2 (φ ) = 1 − H (φ ) are membership functions, where H(z) represents the Heaviside function. I(x, y) denotes the pixel value at location (x, y) of image I. φ denotes the level set function. By using zero level set to represent object contours, the CV model can perform image segmentation when the above energy function is minimized. Since the CV model uses image intensity information rather than image gradient to define the energy function, it can accurately segment objects with weak edges and noise [12]. Nevertheless, because of using two constants to approximate the image intensity inside and outside the evolution curve and of using only image grayscale information to define the energy function, it has poor performance in segmenting color objects or objects with inhomogeneous intensity distribution [5,13,15,16]. To address this issue, Li et al. [11] propose the LBF model by embedding a local kernel function into the CV model. A two-phase level set formulation of the LBF model is defined as:

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

149

Fig. 1. Workflow of graph cuts.

E LBF =

2    i=1





K (y − x )|I (x ) − fi (x )|2 Mi (φ (x ))dxdy

 1 (|∇φ| − 1 )2 dx + ν |∇ H (φ )| dx,  2 

(2)

where K is the truncated Gaussian function to control the size of the local neighborhood. fi (i = 1, 2 ) are two functions that fit local image intensities near the point x inside or outside of the evolving curve. The second and the third terms are the conventional penalty term and the length regularization term, respectively, and μ, ν are their balancing factors. By introducing Gaussian function to extract image local intensity information, the LBF model can successfully segment images with slight intensity inhomogeneity. But the LBF model may fail to segment images when local intensity changes rapidly since it assumes that the intensity in a local small neighborhood is homogeneous. Besides, because of using only image grayscale information to define its energy function, the LBF model has poor performance in color object segmentation. To further address the above issue, Dai et al. [21] propose the inhomogeneity-embedded active contour model (InH_ACM) using a pixel inhomogeneity factor proposed in [22], and their energy function is defined as:

E InH _ACM =

2 

λ1

i=1

+

2 

 

λ2

j=1

|I (x, y ) − ci |2 Mi (φ ) dxdy

 

|PIF (x, y ) − χ j | dxdy + μ

 

|∇ H (φ )|

× dxdy, (3) where the first term is the data term of the CV model, and the third term is the length regularization term. χ j ( j = 1, 2 ) in the second term are the average values of pixel inhomogeneity. PIF(x, y) is the pixel inhomogeneity factor, and is defined as: p )| P IF ( p) = ||( , N ( p )|

(4)

where p is a pixel in image I, and N(p) denotes the pixels in the spatial neighborhood of p. ( p) = {q ∈ N ( p) : |I ( p) − I (q )| > ι}, where ι > 0 is a given threshold. Although the InH_ACM can be applied to color object segmentation, it is applicable to color objects with textures but not to all kinds of color objects. There are two main factors that contribute to this problem: 1) The pixel inhomogeneity factor is designed for segmenting images with textured objects and not for all images; 2) The first term of the InH_ACM

uses only image intensity information to guide curve evolution. Furthermore, since no penalty term is used in the energy function, the user needs to manually re-initialize the level set function of InH_ACM to make it close to a signed distance function during curve evolution, which seriously impacts the segmentation speed of the InH_ACM. 2.2. Graph cuts for segmentation Graph cuts, a popular optimization framework proposed by Boykov et al. [38] for binary classification, has been a key method for solving numerous low-level computer vision problems, such as image segmentation [39] and video matting [3]. The main idea of graph cuts is to formulate the segmentation problem as a graph cut process, as shown in Fig. 1. Specifically, the original image with labeled seeds (foreground seed and background seed) (Fig. 1(a)) first needs to be represented as an undirected graph G =< V, E > (Fig. 1(b)), where V denotes a set of nodes (vertices) and E denotes a set of undirected edges with costs/weights connecting the nodes (vertices). As shown in Fig. 1(b), there are two types of nodes in V: terminal nodes called source, s, and sink, t, and non-terminal nodes corresponding to image pixels. Similarly, there are two types of edges in E: n-link that connects two non-terminal nodes and tlink that connects a non-terminal node with a terminal nodes, and all edges are assigned a nonnegative cost/weight. Then, by finding the minimum edge cost, we can complete the graph cut process (Fig. 1(c)). Finally, the final segmentation results can be obtained by transferring the graph to its corresponding image. Considering graph cuts can suppress noise and produce smooth segmentation results [39], in this paper, we propose an automatic initialization method (see Section 3.4) instead of the tedious manual initialization process using graph cuts to segment the saliency map that improves the segmentation efficiency. In addition, as an indispensable part of the energy function of the level set models, the conventional length term plays the role of smoothing the evolving curve and of avoiding the occurrence of small isolated regions in the final segmentation result, but it has poor performance for color images. To address this issue, we use graph cuts but instead of the conventional length term, a new data term is defined (see Section 3.4). 2.3. Saliency detection methods Early saliency detection methods mainly concentrate on using low-level image features and cues to construct contrast and various prior knowledge to obtain the saliency map. Among them,

150

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

Fig. 2. Framework of the proposed SLSM. Firstly, the saliency detection model extracts the saliency and the SLSM realizes automatic initialization using graph cuts. Then, based on image saliency, the global saliency-guided (GS) energy term EGS roughly extracts the object. Concurrently, the proposed local multichannel-based (LM) term ELM encodes color image information. Finally, the smooth segmentation result can be obtained using graph cuts.

contrast is most widely used because researchers believe that the saliency between background and foreground is revealed in regions with high contrast [25,29]. The weighted contrast (WC) saliency detection method proposed by Zhu et al. [29] is a typical representative method of them, which achieves satisfactory results by proposing a novel contrast measure:background weighted contrast. Recently, because of the success of Convolution Neural Networks (CNN), features extracted using CNN have been applied to saliency detection [32,35] with better performance than early hand crafted saliency detection methods. For example, the deep hierarchical saliency network (DHSNet) saliency detection method proposed by Liu et al. [35] shows its superior performance and real-time speed by defining a novel end-to-end deep hierarchical saliency network (DHSNet) based on CNN. In this paper, we mainly use two saliency detection methods: WC [29] and DHSNet [35] to guide our SLSM.

term under the framework of the CV model, and it is defined by:

E GS =

2  i=1

λi

 i

|IS_Map (x ) − ςi |2 dxdy,

(6)

where λ1 and λ2 are balancing factors controlling the weights of the two terms. ∪2i=1 i is a partition of image domain , where 1 ∩ 2 = ∅ and  = 1 ∪ 2 . IS_Map (x ) denotes the pixel set of the image salient map, which is computed by WC [29] and DHSNet [35]. ς 1 and ς 2 are constants describing the average value of image saliency inside and outside the evolving curve, respectively. Based on the image saliency map, the global saliency-guided energy term can efficiently and robustly guide the evolving curve move toward objects in color images and the guidance stops when the curve is close to their boundaries.

3. Methodology 3.2. Local multichannel-based energy term In this section, our new model is presented. In particular, we first show the framework of the proposed SLSM in Fig. 2. Then, we give the detail of the energy function of the SLSM. Finally, we describe the process to minimize the defined energy function using the level set formulation. The overall energy function of the proposed model is defined as:

E SLSM = (1 − ω )E GS + ωE LM + E P ,

(5)

where EGS , ELM and EP are the global saliency-guided energy term, the local multichannel based energy term and the penalty term, respectively. ω is a positive constant to weigh the relative importance of the first two terms. The detailed definition of each term is described in the following subsections. 3.1. Global saliency-guided energy term Since the CV model uses only two constants to globally fit the foreground and background intensities, it avoids the complex approximation process of image intensity and significantly improves the segmentation efficiency. Besides, the CV model also is robust to noise and initialization because it uses global image information to define the energy function. Considering the above advantages of the CV model, we construct our global saliency-guided (GS) energy

To accurately locate boundaries of objects in color images, we define a local multichannel based (LM) energy term. In conventional level set models, they use only intensity information to define their energy function, which is not suitable for color images. Because of this limitation, we use three-channel information in the CIEL∗ a∗ b∗ color space to define the local data term as:

E LM =

2  i=1

λi



 i



 k∈{L∗ , a∗ , b∗ }

Kσ (x − y )|Ik (y ) − ψk (i )|2 dxdy,

(7)

where Kσ (x ) = β e−|x| /(2σ ) , |x| ≤ ρ is a truncated Gaussian function with standard deviation σ , which is used to control the size of the local region, and ρ the local neighborhood radius (usually ρ = 4 ∗ σ + 1 similar to [11]). β denotes the normalization constant. ψ k (i), i = 1, 2, respectively, denote the average value inside and outside of the evolving curve in the k color channel. For example, ψL∗ (1 ) denotes the average value of L∗ inside the evolving curve. Because of using the truncated Gaussian function and three color channels, the proposed local multichannel based energy term can extract the local information of color images, such as boundaries, texture and color features. As a result, it can accommodate weak boundaries of color objects. 2

2

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

151

3.3. Penalty term

foreground ( fF = 1) and background ( fB = 1), respectively. In the step of automatic segmentation, lF and lB are defined as:

For stable evolution and accurate numerical computation, we need to re-initialize the level set function as a signed distance function during evolution, which is tedious and time consuming. To address this issue, Li et al. [5] propose a penalty term as mentioned in Section 2.1 that can avoid the tedious re-initialization process. The energy function of the penalty term is defined by:

lF ( f ) = SMap ( f ), and

 E P (φ ) =  p(|∇φ| ) dx,

(8)

where p(s) is the potential function defined as:

p( s ) =

1 ( s − 1 )2 . 2

(9)

The corresponding diffusion rate dp (s) is:

1 p (s ) d p (s ) = =1− . s s

(10)



( s − 1 )2 ( − 1 )2

1 2 s 2 1 s 2



(s − 1 )(2s − 1 )

1−

1 s

lF ( f ) = H ( f ), and lB ( f ) = 1 − H ( f ),

where H(f) is the Heaviside function. The second term of Eq. (13) is the smoothness term used to describe the cost/weights of edges n-link (see Fig. 1(b)) that connects two non-terminal nodes, and is defined as:

Esmooth ( f ) =

(11)

if s ≤ 1, if s ≥ 1.

(12)

Most exiting LSMs still use labor-intensive user initialization. With the development of artificial intelligent, minimal user interaction is a recent trend and the level set models should be no exception. Thus, in this section, we proposed an automatic initialization method using graph cuts to cut saliency map. Besides, rather than using the length regularization term in most level set models, we use graph cuts to avoid the occurrence of small isolated regions in the final segmentation, which achieves better performance. Specifically, the cost function of our graph cuts is defined as:

E ( f ) = Edata ( f ) + λEsmooth ( f ).

(13)

The first term is the data term representing the cost/weights of edges t-link (see Fig. 1(b)) that connects a non-terminal node with a terminal nodes, and is defined as:

⎧ 255  ⎪ ⎨− log lF ( f )

if fF = 1,

⎪ ⎩− log

if fB = 1,

k=1

(14) lB ( f )

− exp





|Ip −Iq |2

if f ( p) = f (q ),



else,

2

(17)

3.5. Level set formulation and numerical implementation

φ into it using the Heaviside function H(z) and the Dirac function δ (z). The foreground and background of an image are represented by 1 (φ ≥ 0) and 2 (φ ≤ 0), respectively. Thus, Eq. (5) can be rewritten as:

3.4. Level-set based graph cuts

k=1 255 

0

where Ip and Iq are the values of adjacent pixels p and q, respectively.

E SLSM = (1 − ω )

where f is the labeled seeds (see Fig. 1(a)) and fF and fB , respectively, denote the pixel being labeled as the foreground and background. Specifically, when fF = 1, it denotes that the pixel is labeled as the foreground and when fB = 1, it denotes that the pixel is labeled as the background. There is a implicit constraint fF = 1 − fB . lF and lB are the energy (cost) of labeling each pixel as

2 

λi

i=1

We observe that the diffusion rate d p (s ) = 1 rather than dp (s) → ∞ when s → 0, which avoids the unstable condition mentioned above. Besides, the improved penalty term can keep the level set function φ a signed distance function (|∇φ| = 1) with the minimum of the potential function p(s) at s = 1.

Edata ( f ) =

(16)

To compute Eq. (5), we need to introduce the level set function

if s ≤ 1, if s ≥ 1.

The corresponding diffusion rate dp (s) is:

p (s ) d p (s ) = = s

(15)

where SMap (f) is the saliency map. In the step of smoothing the segmentation result, they are defined as:

From the above equations, we can find that when s → 0, the diffusion rate dp (s) → ∞, which causes an unstable condition and affects the normal evolution of the level set curve [6]. So, in this paper, we define an improved penalty term using a piecewise polynomial to address this issue. The potential function of our penalty term is defined as:

p( s ) =

lB ( f ) = 1 − SMap ( f ),



2 

λi

i=1

|IS_Map (x ) − ςi |2 Mi (φ ) dxdy







× dxdy +



  k∈{L∗ , a∗ , b∗ }



Kσ (x − y )|Ik (y ) − ψk (i )|2 Mi (φ )

p(|∇φ| ) dx,

(18)

where M1 (φ ) = H (φ ) and M2 (φ ) = 1 − H (φ ) are membership functions for the foreground and background. H(φ ) and δ (z) are defined as:



H (φ ) =

1, 0,

if if

φ≥0 φ ≤ 0,

(19)

and

δ (φ ) =

d H ( φ ). dφ

(20)

We now present the method to minimize Eq. (18) with respect to φ , which is equivalent to solving the steady-state solution of the gradient flow equation [15,16]:

∂φ ∂ E SLSM =− , ∂t ∂φ

(21)

SLSM E GS E LM EP where ∂ E∂φ = μ ∂∂φ + ν ∂∂φ + ∂∂φ is the Gaˆ teaux derivative summations of the global saliency-guided energy term EGS , the local multichannel-based energy term ELM and the penalty term EP . To solve them, we smoothly approximate the Heaviside function H(φ ) and the Dirac delta function δ (φ ) as:



1 2 Hε (φ ) = 1 + arctan 2 π

  φ , ε

(22)

and

δε (φ ) =

ε 1 , π ε2 + φ 2

(23)

152

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

where ε is a positive constant used to control the degree of Hε ( · ) approaching H( · ), and is fixed to 1 as in [5,6,9,11]. Next, keeping all the other parameters: ς i and ψ k (i) fixed, we obtain:

∂ E GS (φ ) = ∂φ

 

[λ1 (IS_Map (x ) − ς1 )2 − λ2 (IS_Map (x ) − ς2 )2 ]δε (φ )dxdy

(24)

∂ E LM (φ ) = ∂φ

 

k∈{

L∗ ,

a∗ ,



− λ2

Kσ (x − y )|Ik (y ) − ψk (1 )|2 b∗

}





Kσ (x − y )|Ik (y ) − ψk (2 )|2 δε (φ )dxdy



λ1

4. Experiments

Kσ (x )  |Ik (x ) − ψk (1 )|2

k∈{L∗ , a∗ , b∗ }



− λ2

 Kσ (x )  |Ik (x ) − ψk (2 )|2 δε (φ ) dx,

k∈{L∗ , a∗ , b∗ }

(25) where  represents convolution calculation. Since ∂ E∂φ(φ ) has already been analyzed in [6], its derivation is omitted. Thus, the Euler-Lagrange equation can be obtained as: P

(1 − ω )[λ1 (IS_Map (x ) − ς1 )2 − λ2 (IS_Map (x ) − ς2 )2 ]δε (φ )   + ω λ1 Kσ (x )  |Ik (x ) − ψk (1 )|2 k∈{L∗ , a∗ , b∗ }



− λ2



Kσ (x )  |Ik (x ) − ψk (2 )|

2

δε (φ )

k∈{L∗ , a∗ , b∗ }

− div(d p (|∇ φ| )∇ φ ) = 0.

(26)

According to the variational principle, the final gradient descent flow is given by:

∂φ = −δε (φ )(α1 − α2 ) + div(d p (|∇ φ| )∇ φ ), ∂t

(27)

αi = (1 − ω )λi (IS_Map (x ) − ςi )2 + ωλi



Kσ (x )  |Ik (x ) − ψk (i )|2 ,

(28)

k∈{L∗ , a∗ , b∗ }

where div is the divergence operator. dp (s) is the diffusion rate given in Section 3.3. Similarly, minimizing with respect to variables ς i , ψ and fixing φ , we obtain their update formulas as follows:  I M (φ ) dxdy ςi =  S_Map i , (29)  Mi (φ ) d xd y and

ψk ( i ) =

Input: Read in an image to be segmented. Initialization: 1. Initialize the level set function φ0 and set φ 1 = φ0 ; 2. Initialize the related parameters: ω, σ , ε , h, t; Repeat: 1. Compute Hε and δε according to Eq. (22) and Eq. (23); 2. Compute ςi , ψk (i ) according to Eq. (29) and Eq. (30); 3. Update the level set function according to Eq. (31). Until: (|φ i+1 − φ i | < τ ) or the iteration number equal  Output: The final segmentation result φ = φ i+1

k∈{L∗ , a∗ , b∗ }

 =



λ1

Algorithm 1 Algorithm steps of the SLSM.

Kσ  [Mi (φ )Ik ] . Kσ  Mi (φ )

(30)

To implement the above update formulas, the level set φ (x, y) needs to be discretized to φ (x, y, t). Specifically, after n iterations, at mesh point (i, j), the level set is denoted as φ (ih, jh, nt), where h denotes the grid spacing interval and t the time step. Thus, Eq. (27) can be rewritten as:

φi,n+1 − φi,n j   j = −δε (φi,n j ) α1 (ih, jh, nt ) − α2 (ih, jh, nt ) t + div(d p (|∇i, j φi,n j | )∇i, j φi,n j ), (31) The steps of the SLSM are shown in Algorithm 1, where τ and ϱ are constants to terminate the curve evolution.

In this section, we evaluate the performance of our proposed SLSM. The algorithm is implemented in Matlab R2017b on a PC with Intel Core i7 3.4 GHz CPU and 24 GB RAM. In addition, we fix the parameters ε = 1, h = 1, t = 0.1, τ = 0.001 and  = 200 in all experiments, and the values of other parameters ω and σ are discussed in Section 4.4.3. The initialization level set function φ 0 is obtained using two methods: 1) different manual strokes for the cases of comparing the proposed method with level set models (see Section 4.3.1) and 2) the results of using graph cuts to segment saliency maps for the other cases. For each compared method, we use the original parameter settings included in the source code. Besides, in all the experiments, we mainly use two saliency detection methods: WC [29] and DHSNet [35] to guide our SLSM. Specifically, we use DHSNet [35] to guide our SLSM when comparing our method with deep learning based saliency detection methods, and use WC [29] for the other cases. 4.1. Datasets Even though the level set models are popular methods for object segmentation, to our best knowledge, specialized datasets are not available for evaluating object segmentation using level set based methods. To facilitate future research in this direction, we construct a new dataset containing 20 0 0 images with binary ground truth. They are all collected from public benchmarks for object detection including ASD [24], DUT-OMRON [27], ECSSD [28], HKU-IS [32] and MRSA-B [40]. We select 20 0 0 images from the above 5 datasets rather than use them directly because many images in them are not all suitable for evaluating level set methods. 4.2. Evaluation metrics To quantitatively evaluate the SLSM, three commonly used metrics in object segmentation are used in this paper including the Jaccard similarity (JS), Dice coefficient (DC) and mean absolute error (MAE). The JS is a statistical metric measuring the similarity between the ground truth and the segmentation result. DC denotes the percentage of object pixels in the ground truth SG and in the binary segmentation result ST . Their definitions are given below:

JS =

|SG ∩ ST | , |SG ∪ ST |

(32)

and

DC =

2|SG ∩ ST | . |SG | ∪ |ST |

(33)

Both of the above metrics are based on the overlapped region, which usually give a high score when ST is close to SG . Different from them, the MAE is a metric evaluating the per-pixel difference

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

153

Fig. 3. Comparison results with the CV model [9] and the LBF model [11]. The blue boxes show the initialization contours. From top to bottom: original images with corresponding initial contours, results of CV, LBF, SLSM and ground truth (GT), respectively. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

between the ground truth and the segmentation result, and is defined by:

MAE =

W H 1  |S(x, y ) − G(x, y )|, W ·H

(34)

x=1 y=1

where S and G are the segmentation result and the ground truth, respectively. W, H are the width and height of the input image. 4.3. Comparison experiments We compare our method with 8 state-of-the-art level set models (CV [9], LBF [11], LaGII [13], GARAC [14], SLSA [15], LHIF [16], CLSM [2] and InH_ACM [21]), and 11 saliency detection methods (SF [25], GS [26], MR [27], WC [29], SOB [30], MDF [32], LEGS [33], DCL [34], DHSNet [35], DLS [36] and UCF [37]). Note that the last 6 saliency detection methods are recent deep learning-based methods. 4.3.1. Comparison with level set models Since the proposed model is defined under the framework of the CV [9] model and the LBF [11] model, we conduct an experiment to compare the proposed method with the two level set models using our dataset. As shown in Fig. 3, by incorporating visual saliency and multichannel color information into the two models, our method achieves significant improvement compared to the other two, especially for objects with complex backgrounds. Fig. 4 shows visual comparison of our method with 6 sate-ofthe-art level set models (LaGII [13], GARAC [14], SLSA [15], LHIF [16], CLSM [2] and InH_ACM [21]) using our dataset. From the segmentation results, we can see that the 6 level set models mostly work for segmenting objects with homogeneous foreground and background, but usually obtain inaccurate segmentation with isolated outliers because their energy functions are constructed using

grayscale information only. In contrast, our method using multichannel information of color image to define the energy function can accurately extract the color objects. To quantitatively compare our method with 6 sate-of-the-art level set models, Table 1 shows the average values of Dice coefficient (DC) and Jaccard similarity (JS) for segmenting 20 0 0 images in our dataset. It can be seen that our method achieves the best segmentation accuracy compared with other methods. Fig. 5 shows the corresponding CPU running times and iteration times of the above methods segmenting the six images of Fig. 4, which show that our method is the fastest. That is because our global saliency-guide energy term can guide the evolving curve to converge rapidly around the objects and improve the segmentation speed directly. 4.3.2. Comparison with saliency detection methods We first compare our method with 5 classical saliency detection methods: SF [25], GS [26], MR [27], WC [29], SOB [30]. The saliency maps and their corresponding Precision-Recall (PR) curves are shown in Figs. 6 and 7, respectively, where the PR curves are obtained by computing the precision and recall values:

P recision = |SG|S∩S|T | , Recall = |SG|S∩S|T | , G T

(35)

with the range of saliency values changed from [0 1] to [0 255]. Specifically, for each threshold, we obtain a pair of precision/recall scores, and finally we combine all the scores to form the PR curve to evaluate the model performance. Note that unlike saliency detection methods, since our method produces one binary segmentation result directly, we get one precision and one recall values only as shown in Fig. 7. In addition, the closer the PR curve to the right corner, the better the result. We observe that our method performs favorably compared with the 5 classical saliency detection methods. That is because the incorporation of visual saliency into our level set formulation can effectively help to discriminate pixels

154

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

Fig. 4. Comparison results with state-of-the-art level set models. From top to bottom: original images with corresponding initial contours, results of LaGII [13], GARAC [14], SLSA [15], LHIF [16], CLSM [2], InH_ACM [21], SLSM and ground truth (GT), respectively. Table 1 The values of DC and JS. The best results are shown in bold. Metrics

LaGII

GARAC

SLSA

LHIF

CLSM

InH_ACM

SLSM

DC JS

0.9030 0.8363

0.9168 0.8520

0.8260 0.7139

0.8590 0.7667

0.9487 0.9083

0.8638 0.7964

0.9838 0.9682

along object boundaries, whereas the saliency detection methods obtain fuzzy results. Consider that our segmentation results are binary, to be fair, we use graph cuts to segment the saliency maps and obtain the corresponding binary results as shown in Fig. 8. We see that our method consistently outperforms the other 5 saliency detection methods.

In addition, from the values of precision and recall shown in Fig. 9, our method gives the highest accuracy compared to that of the other 5 saliency detection methods. We then quantitatively compare our SLSM with 6 recent deep learning-based saliency detection methods: MDF [32], LEGS [33], DCL [34], DHSNet [35], DLS [36] and UCF [37]. In this experi-

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

155

Fig. 5. The corresponding values of CPU running time and iteration times of LaGII [13], GARAC [14], SLSA [15], LHIF [16], CLSM [2], InH_ACM [21] and SLSM segmenting the six images of Fig. 4.

Fig. 6. Comparison results with saliency detection methods. From top to bottom, original images, saliency maps of SF [25], GS [26], MR [27], WC [29], SOB [30], binary results of SLSM and ground truth (GT), respectively.

156

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

Fig. 7. Precision-Recall (PR) curves of our method and other classical saliency detection methods on 5 public benchmark datasets (the closer of the PR curve to the right corner, the better the result). The results of our method are shown as single points because our segmentation results are binary.

Table 2 Comparisons with six deep learning-based saliency detection methods: MDF [32], LEGS [33], DCL [34], DHSNet [35], DLS [36] and UCF [37] using the mean absolute error (MAE) (defined in Eq. (34)). The results of our approach guided by the WC [29] method and the DHSNet method are shown in the last two columns. The rankings are shown in brackets. The symbol “–” in the table denotes that the dataset is not tested in the corresponding paper and thus the result is not available. Methods

ECSSD

OMRON

HKU-IS

MDF LEGS DCL DHSNet DLS UCF Ours(WC) Ours(DHSNet)

0.1080(5) 0.1180(6) 0.1495(8) 0.0715(3) 0.0900(4) 0.0689(2) 0.1312(7) 0.0613(1)

– 0.1334(6) 0.1573(7) 0.1164(3) 0.0930(2) 0.1203(4) 0.1265(5) 0.0917(1)

0.1290(7) 0.1193(6) 0.1359(8) 0.0763(4) 0.0720(3) 0.0620(2) 0.1152(5) 0.0608(1)

ment, we use the saliency maps produced by the WC method [29] and the DHSNet [35] to guide our level set model. The WC method is not a learning-based method and the DHSNet method is a deep learning-based method. As shown in Table 2, our approach guided by DHSNet achieves the smallest errors on the three above datasets. In addition, our approach guided by the WC method obtains better performance than three out of six deep learning-based methods on the OMRON and the HKU-IS datasets. It means that, grown out of a non-learning-based WC, our level set formulation can outperform some deep-learning based methods. 4.4. Analysis of the SLSM 4.4.1. Robustness analysis 1) Robustness analysis for different initializations: To evaluate the robustness of our method to different saliency maps for initialization, we test our approach using the saliency maps produced by four conventional saliency detection methods: GS [26], MR [27],

WC [29] and SOB [30] with our dataset. The visual and quantitative segmentation results are shown in Figs. 10 and 11, respectively. Similar visual results are obtained using different saliency maps shown in Fig. 10. In addition, the values of JS and DC for segmenting the 20 0 0 images of our dataset using four different conventional saliency detection methods differ within a small range shown in Fig. 11. In other words, the proposed SLSM is robust to different saliency-guided initialization. Besides, by the comparison results between using our method and using four conventional saliency detection methods shown in Fig. 11, we can also observe that the segmentation accuracy using our method is improved. It suggests that our method can work with any saliency detection method. To further evaluate the robustness of our method to different initializations, we test our method using different initializations produced by manual strokes as shown in Fig. 12. From the results, we observe that similar results are obtained using different manual initializations with different shapes, positions and sizes. In other words, our SLSM is robust to different initial contours. 2) Robustness analysis for different color spaces: Here, we test the effectiveness of the proposed SLSM on different color spaces, such as RGB, YCbCr, HSV and NTSC, as shown in Fig. 13. It can be seen that similar segmentation results are obtained using different color spaces, which suggests that the proposed SLSM can work with several commonly used color spaces. 4.4.2. Effectiveness analysis 1) Effectiveness analysis for the energy terms: Here, we evaluate the effectiveness of the local energy term and the global energy term by removing each of them from our energy function. On the one hand, as shown in the third column of Fig. 14, using the global term only (ω=0) cannot help to propagate the initial contours to the object boundaries. On the other hand, using the local term only (ω=1) cannot capture the object regions that are far from the initial contours: see the third and the fourth examples in Fig. 14. In contrast, using both terms (ω=0.5), the proposed SLSM can successfully extract object contours, which demonstrates the effectiveness of each term of the SLSM.

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

157

Fig. 8. Comparison results with saliency detection methods. From top to bottom, original images, binary results of SF [25], GS [26], MR [27], WC [29], SOB [30], SLSM and ground truth, respectively.

2) Effectiveness analysis for the graph cuts: To evaluate the effectiveness of graph cuts for smoothing segmentation results, a comparison experiment with the traditional length regularizationbased smoothness term is conducted. As shown in the first row of Fig. 15, the results without using any smoothness term have many isolated regions that seriously impacts the accuracy. To address this issue, traditional level set models use a length regularization term in their energy functions. In contrast, our level set formulation with graph cuts achieves the best performance. 4.4.3. Parameter analysis In this subsection, we give a detailed analysis and discussion of the influence of parameters used in this paper on the quality of segmentation results. In our experiments, the weight ω between the global energy term and the local energy term and standard deviation σ of the truncated Gaussian function used to control the size of local region play the main role in the whole segmentation

process, and the other parameters, such as ε , h, t are set based on suggestions as given in previous papers [6,11]. Thus, we mainly analyze the influence of parameters: ω, σ in segmentation. 1) Influence of ω: ω mainly weighs relative importance of the global energy term and the local energy term of our SLSM, as shown in Fig. 14. When ω takes a smaller value, for example 0, the global term plays the main role in guiding curve evolution. In contrast, when ω takes a bigger value, for example 1, the local term will play the main role. In our experiments, we set ω = 0.5 to let the global energy term and the local energy term play an equal role in guiding curve evolution. Fig. 16(a) shows the variation of JS when ω increases from 0 to 1 during the proposed method segmenting the four images of Fig. 14. It can be observed the highest JS value is reached when ω = 0.5, which validates the rationale of setting ω = 0.5. 2) Influence of σ : Fig. 17 shows the segmentation results of SLSM using different σ values. We can see that when σ = 1, the

158

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

Fig. 9. Illustration of precision and recall values of our method and other classical saliency detection methods on 5 public benchmark datasets.

Fig. 10. Segmentation results of our method SLSM under different initializations. In each subfigure, the first column shows the original image (top) and the ground truth (GT) (bottom). The remaining four columns show the initial contours (top) produced from the saliency detection methods of GS [26], MR [27], WC [29], SOB [30], and the corresponding segmentation results (bottom).

Fig. 11. Illustration of the values of DC and JS of the proposed method using different saliency maps produced by GS [26], MR [27], WC [29] and SOB [30]. The red lines denote the results using our method, and the blue lines denote results using conventional methods. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

159

Fig. 12. Robustness of our method SLSM to different manual stroke initializations. In each subfigure, the first column shows the original image (top) and the ground truth (GT) (bottom). The remaining four columns show the different initial contours (top) produced by manual stroke and the corresponding segmentation results (bottom).

Fig. 13. Tests on different color spaces. From left to right: original images, segmentation results of RGB, YCbCr, HSV, NTSC color space, respectively.

160

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

Fig. 14. Illustration of the effectiveness of each energy terms of our model. From left to right: original images, initial contours, results of using global saliency-guided term only (ω=0), using local multichannel-based term only (ω=1), using the two terms simultaneously (ω=0.5) and ground truth, respectively.

Fig. 15. Illustration of the effectiveness of graph cuts. From top to bottom: segmentation results of the SLSM without using the smoothness term, using the traditional length smoothness term, using graph cuts and the ground truth, respectively.

Fig. 16. Illustration of the values of JS and DC of the proposed method using different ω (corresponding to Fig. 14) and σ (corresponding to Fig. 17).

segmentation results are influenced by different initializations (see the segmentation results of the second column of Fig. 17), which shows that a small value of σ easily leads to overfitting and decreases the robustness of the proposed method. In contrast, when σ = 10, the SLSM cannot accurately segment the object regions possessing similar colors as the background, which shows that a large value of σ leads to underfitting and prevent the method to accurately extract the local information of color images (see the segmentation results of the fourth column of Fig. 17). Based on the above discussion, we set σ = 5 for all the experiments in this paper because using this value, the proposed SLSM can give good performance (see the segmentation results of the third column of

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

161

Fig. 17. Segmentation results of SLSM using different σ values. From left to right: original images with corresponding initial contours, results of σ = 1, σ = 5, σ = 10, respectively.

Fig. 18. Illustration of the worse performance of SLSM (top row: segmentation results, bottom row: ground truth).

Fig. 17). To further validate the above setting, Fig. 16(b) shows the variation of DC when σ increases from 1 to 10 during the proposed method segmenting the image of Fig. 17 using 3 different initializations. It can be observed that the highest DC value is reached when σ = 5, which validates the rationale of setting σ = 5. 5. Conclusion and future work This paper proposes a new saliency-guided level set model (SLSM) for automatic color object segmentation. The SLSM consists of two main energy terms: the global saliency-guide (GS) energy term and the local multichannel-based (LM) energy term. The GS energy term can guide evolving curve to converge rapidly around the object and improves the segmentation speed and robustness of the SLSM. Based on different color channel information, the LM energy term can accurately extract the boundaries of objects and improves the segmentation accuracy of the SLSM. In addition, the combination of the Heaviside function and the data term in our graph cuts formulation significantly improves the segmentation accuracy and avoids small isolated regions often appeared in many active contour based segmentation results. Besides, to avoid the tedious manual initialization process, the proposed SLSM is initialized using a salient region segmentation. The exten-

sive experiments on public benchmark datasets and our collected dataset demonstrate that our method outperforms the state-ofthe-art level set models and saliency detection methods or is even superior to recent state-of-the-art deep learning-based saliency detection methods. The proposed method is still far from perfect. Fig. 18 shows the failure cases of our method SLSM. It can be seen that our SLSM yields poor performance when applied to images with severe color inhomogeneity. For example, when one of the foreground and the background is severely inhomogeneous (see the segmentation results of Fig. 18(a) and (b)) or both of them are severely inhomogeneous (see the segmentation results of Fig. 18(c) and (d)), we will get inaccurate segmentation results. As a future work, we will avoid this weakness by defining the color bias field estimation term based on estimate that can accommodate for color inhomogeneity of color objects. Acknowledgments This work is supported by, the National Natural Science Foundation of China (No. 61502396, No. 61501286), the Fundamental Research Funds for the Central Universities (No. GK201702015), the National Key Research and Development Program of China (No.

162

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163

2016YFB1001004), the Natural Sciences and Engineering Research Council of Canada and the University of Alberta. Supplementary material Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.patcog.2019.04.019. References [1] S. Bai, S. Sun, X. Bai, Z. Zhang, Q. Tian, Improving context-sensitive similarity via smooth neighborhood for object retrieval, Pattern Recognit. 83 (2018) 353–364. [2] S. Zhou, J. Wang, M. Zhang, Q. Cai, Y. Gong, Correntropy-based level set method for medical image segmentation and bias correction, Neurocomputing 234 (2017) 216–229. [3] M. Gong, Y. Qian, L. Cheng, Integrated foreground segmentation and boundary matting for live videos, IEEE Trans. Image Process. 24 (4) (2015) 1356–1370. [4] S. Osher, J.A. Sethian, Fronts propagating with curvature-dependent speed: algorithms based on hamilton-jacobi formulations, J. Comput. Phys. 79 (1) (1988) 12–49. [5] C. Li, C. Xu, C. Gui, M.D. Fox, Level set evolution without re-initialization: a new variational formulation, in: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, 2005, pp. 430–436. [6] C. Li, C. Xu, C. Gui, M.D. Fox, Distance regularized level set evolution and its application to image segmentation, IEEE Trans. Image Process. 19 (12) (2010) 3243–3254. [7] X.-H. Zhi, H.-B. Shen, Saliency driven region-edge-based top down level set evolution reveals the asynchronous focus in image segmentation, Pattern Recognit. 80 (2018) 241–255. [8] M. Gobbino, Finite difference approximation of the mumford-shah functional, Commun. Pure Appl. Math. 51 (2) (1998) 197–228. [9] T.F. Chan, L.A. Vese, Active contours without edges, IEEE Trans. Image Process. 10 (2) (2001) 266–277. [10] Q. Cai, H. Liu, S. Zhou, J. Sun, J. Li, An adaptive-scale active contour model for inhomogeneous image segmentation and bias field estimation, Pattern Recognit. 82 (2018) 79–93. [11] C. Li, C.-Y. Kao, J.C. Gore, Z. Ding, Implicit active contours driven by local binary fitting energy, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–7. [12] X.-F. Wang, D.-S. Huang, H. Xu, An efficient local chan–vese model for image segmentation, Pattern Recognit. 43 (3) (2010) 603–618. [13] S. Zhou, J. Wang, S. Zhang, Y. Liang, Y. Gong, Active contour model based on local and global intensity information for medical image segmentation, Neurocomputing 186 (2016) 107–118. [14] H. Wang, T.-Z. Huang, Z. Xu, Y. Wang, A two-stage image segmentation via global and local region active contours, Neurocomputing 205 (2016) 130–140. [15] K. Zhang, L. Zhang, K.-M. Lam, D. Zhang, A level set approach to image segmentation with intensity inhomogeneity, IEEE Trans. Cybern. 46 (2) (2016) 546–557. [16] L. Wang, Y. Chang, H. Wang, Z. Wu, J. Pu, X. Yang, An active contour model based on local fitted images for image segmentation, Inf. Sci. 418 (2017) 61–73. [17] L. Wang, J. Zhu, M. Sheng, A. Cribb, S. Zhu, J. Pu, Simultaneous segmentation and bias field estimation using local fitted images, Pattern Recognit. 74 (2018) 145–155. [18] A. Rodtook, K. Kirimasthong, W. Lohitvisate, S.S. Makhanov, Automatic initialization of active contours and level set method in ultrasound images of breast abnormalities, Pattern Recognit. 79 (2018) 172–182. [19] Q. Cai, H. Liu, Y. Qian, J. Li, X. Duan, Y.-H. Yang, Local and global active contour model for image segmentation with intensity inhomogeneity, IEEE Access 6 (2018) 54224–54240. [20] B. Han, Y. Wu, Active contours driven by global and local weighted spf for image segmentation, Pattern Recognit. 88 (2019) 715–728. [21] L. Dai, J. Ding, J. Yang, Inhomogeneity-embedded active contour for natural image segmentation, Pattern Recognit. 48 (8) (2015) 2513–2529. [22] J. Ding, J. Shen, H. Pang, S. Chen, J. Yang, Exploiting intensity inhomogeneity to extract textured objects from natural scenes, in: Asian Conference on Computer Vision, 2009, pp. 1–10. [23] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Trans. Pattern Anal. Mach.Intell. 20 (11) (1998) 1254–1259. [24] R. Achanta, S. Hemami, F. Estrada, S. Süsstrunk, Frequency-tuned salient region detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1597–1604. [25] F. Perazzi, P. Krähenbühl, Y. Pritch, A. Hornung, Saliency filters: contrast based filtering for salient region detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 733–740. [26] Y. Wei, F. Wen, W. Zhu, J. Sun, Geodesic saliency using background priors, in: European Conference on Computer Vision, 2012, pp. 29–42. [27] C. Yang, L. Zhang, H. Lu, X. Ruan, M.-H. Yang, Saliency detection via graph-based manifold ranking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3166–3173. [28] Q. Yan, L. Xu, J. Shi, J. Jia, Hierarchical saliency detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1155–1162.

[29] W. Zhu, S. Liang, Y. Wei, J. Sun, Saliency optimization from robust background detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2814–2821. [30] R.S. Srivatsa, R.V. Babu, Salient object detection via objectness measure, in: IEEE International Conference on Image Processing, 2015, pp. 4481–4485. [31] M.-M. Cheng, Z. Zhang, W.-Y. Lin, P. Torr, Bing: Binarized normed gradients for objectness estimation at 300fps, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3286–3293. [32] G. Li, Y. Yu, Visual saliency based on multiscale deep features, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5455–5463. [33] L. Wang, H. Lu, X. Ruan, M.-H. Yang, Deep networks for saliency detection via local estimation and global search, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3183–3192. [34] G. Li, Y. Yu, Deep contrast learning for salient object detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 478–487. [35] N. Liu, J. Han, Dhsnet: Deep hierarchical saliency network for salient object detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 678–686. [36] P. Hu, B. Shuai, J. Liu, G. Wang, Deep level sets for salient object detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017. [37] P. Zhang, D. Wang, H. Lu, H. Wang, B. Yin, Learning uncertain convolutional features for accurate saliency detection, in: IEEE International Conference on Computer Vision, 2017. [38] Y. Boykov, O. Veksler, R. Zabih, Fast approximate energy minimization via graph cuts, IEEE Trans. Pattern Anal. Mach.Intell. 23 (11) (2001) 1222–1239. [39] S. Yin, Y. Qian, M. Gong, Unsupervised hierarchical image segmentation through fuzzy entropy maximization, Pattern Recognit. 68 (2017) 245–259. [40] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, H.-Y. Shum, Learning to detect a salient object, IEEE Trans. Pattern Anal. Mach.Intell. 33 (2) (2011) 353–367. Qing Cai received the M.E. degree in the Department of Automation from Northwestern Polytechnical University, Xi’an, China, in 2016. He is currently a PhD student in the Department of Automation, Northwestern Polytechnical University. His main research interests include image segmentation, target tracking, deep learning and machine learning, etc. Huiying Liu received her B.E., M.E. and PhD. degrees from Northwestern Polytechnical University, Xi’an, China, in 1981, 20 0 0 and 20 07, respectively. She was a senior visiting scholar at Clyde University from 2003 to 2004. She is deputy director of Department of systems engineering at Shaanxi Institute of automation now. Since October 1996, she has been a Professor in the Department of Automation, Northwestern Polytechnical University. Her research interests cover a wide range of topics from computer science to computer vision, which include system modeling and simulation, image processing and image recognition, pedestrian detection, single target tracking, multi-target tracking. She has published over 60 papers in international journals and conference. She has completed more than 20 important research projects and funds, including National Natural Science Foundation, “863” project, the General Armament Department Project and aviation science fund, etc. She has won a third prize of scientific and technological progress of the National Defense Science and Technology Commission, a second prize of Shaanxi Province teaching achievement, as well as a Northwestern Polytechnic University Teaching Achievement Award and other awards more than 20 items. Yiming Qian received the B.E. degree from University of Science and Technology of China, Hefei, China, in 2012, and the M.Sc. degree from Memorial University of New-foundland, St. John’s, Canada, in 2014. He is currently pursuing the Ph.D. degree at the Department of Computing Science, University of Alberta, Edmonton, Canada. His research interests include machine learning and computer vision. Sanping Zhou received the M.E. degree from Northwestern Polytechnical University, Xi’an, China, in 2015. He is currently pursuing the Ph.D. degree in Institute of Artificial Intelligence and Robotics at Xi’an Jiaotong University. His research interests include machine learning, deep learning and computer vision, with a focus on medical image segmentation, person re-identification, image retrieval, image classification and visual tracking. He has published about 15 conference and journal papers. Xiaojun Duan is professor of Northwestern Polytechnical University AIEEE member and AIAA member. He graduated from Northwestern Polytechnical University in 2002, and received a master’s degree in navigation, guidance and control in 2004, and a doctorate in control science and Engineering in 2010. Research directions include navigation guidance and control, flight simulation and testing, redundancy management and airborne embedded software. He engaged in the unmanned aerial vehicle (UAV) industry for 15 years, mainly engaged in flight control, UAV vision system, navigation guidance, avionics, simulation testing technology research. Yee-Hong Yang received his BSc (first honors) from the University of Hong Kong, his MSc from Simon Fraser University, and his M.S.E.E. and PhD from the University of Pittsburgh. He was a faculty member in the Department of Computer Science at the University of Saskatchewan from 1983 to 2001 and served as Graduate Chair from 1999 to 2001. While there, in addition to department level committees, he also served on many college and university level committees. Since July 2001, he has been a Professor in the Department of Computing Science at the University of Alberta. He served as Associate Chair (Graduate Studies) in the same department from

Q. Cai, H. Liu and Y. Qian et al. / Pattern Recognition 93 (2019) 147–163 20 03 to 20 05. His research interests cover a wide range of topics from computer graphics to computer vision, which include physically based animation of Newtonian and non-Newtonian fluids, texture analysis and synthesis, human body motion analysis and synthesis, computational photography, stereo and multiple view computer vision, and underwater imaging. He has published over 100 papers in international journals and conference proceedings in the areas of computer vision and

163

graphics. He is a Senior Member of the IEEE and serves on the Editorial Board of the journal Pattern Recognition. In addition to serving as a reviewer to numerous international journals, conferences, and funding agencies, he has served on the program committees of many national and international conferences. In 2007, he was invited to serve on the expert review panel to evaluate computer science research in Finland.