Engineering Applications of Artificial Intelligence 77 (2019) 159–176
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
Roadside vegetation segmentation with Adaptive Texton Clustering Model Ligang Zhang ∗, Brijesh Verma School of Engineering and Technology, Central Queensland University, Brisbane, Australia
ARTICLE
INFO
Keywords: Vegetation segmentation Feature extraction Supervised learning K-means clustering Object recognition
ABSTRACT Automatic roadside vegetation segmentation is important for various real-world applications and one main challenge is to design algorithms that are capable of representing discriminative characteristics of vegetation while maintaining robustness against environmental effects. This paper presents an Adaptive Texton Clustering Model (ATCM) that combines pixel-level supervised prediction and cluster-level unsupervised texton occurrence frequencies into superpixel-level majority voting for adaptive roadside vegetation segmentation. The ATCM learns generic characteristics of vegetation from training data using class-specific neural networks with color and texture features, and adaptively incorporates local properties of vegetation in every test image using texton based adaptive K-means clustering. The adaptive clustering groups test pixels into local clusters, accumulates texton frequencies in every cluster and calculates cluster-level class probabilities. The pixel- and cluster-level probabilities are integrated via superpixel-level voting to determine the category of every superpixel. We evaluate the ATCM on three real-world datasets, including the Queensland Department of Transport and Main Roads, the Croatia, and the Stanford background datasets, showing very competitive performance to state-of-the-art approaches.
1. Introduction Automatic segmentation of roadside vegetation such as trees and grasses from roadside scene data is a pre-requisite for performing detailed analysis on the conditions of the vegetation, such as growth stages, height, brownness and size. Getting access to these conditions plays a significant role in both the fields of agriculture and forestry such as monitoring vegetation growth conditions, making effective plans for disease treatments, as well as developing effective roadside management strategies. However, it is still a challenging task to develop automatic systems for robust vegetation segmentation in an unconstrained natural condition, primarily because of the frequent presence of substantial variations in the appearance and structure of roadside vegetation, environmental conditions, and data capturing settings. The visual configuration of roadside vegetation is often characterized by highly unstructured, dynamic, and unpredictable appearance that can be represented by intensity, color, texture, shape, geometry, or location etc. The scene data is often accompanied by various types of real-world environmental effects, such as under- or over-exposure, shadows of objects and low illumination. These effects may change substantially and can be potentially impacted by various factors including geographic location, time of the day, season, climate, etc. The data capturing settings are also likely to significantly influence the quality of the captured data, leading to data with blurred content, low resolution, varied camera viewpoints, etc. Thus, a central task of roadside vegetation segmentation
is to design robust systems that can effectively represent the main generic characteristics of vegetation, while have high robustness against environmental effects and scene content in new unseen data. Vegetation analysis has been given increasing attention in many fields, including remote sensing, forestry, agriculture, and ecosystem (Ponti, 2013). Most existing studies (Lu et al., 2016; Schaefer and Lamb, 2016; Chang et al., 2017) exploited remotely sensing solutions based on satellite and aerial data collected using various optical imaging sensors which are mounted on space-borne, airborne, or terrestrial platforms. However, there has been only limited attention specifically given to vegetation analysis using ground-based roadside data, which can be collected using ordinary optical equipment such as vehicle-mounted cameras and mobile phone cameras. Compared with remotely sensed data, ground-based data owns several appealing advantages, such as being easy to operate, requiring less expense, and supporting locationspecific analysis. A close-distance precise analysis of roadside vegetation is important for obtaining various parameters of vegetation in practical applications such as the assessment of fire hazards and the diagnosis of diseases associated with vegetation. Existing solutions to roadside vegetation segmentation from groundbased data can be roughly classified into three groups, accordingly to the type of feature used: (1) visible feature based approaches, which discriminate vegetation from other objects based on their visual characteristics extracted from the visible spectrum such as structure, color, and texture. Once the visual features are extracted, a supervised prediction
∗ Corresponding author. E-mail addresses:
[email protected] (L. Zhang),
[email protected] (B. Verma).
https://doi.org/10.1016/j.engappai.2018.10.009 Received 5 May 2018; Received in revised form 2 September 2018; Accepted 9 October 2018 Available online 30 October 2018 0952-1976/© 2018 Elsevier Ltd. All rights reserved.
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
algorithm is then trained to classify new input data into different object categories. A separate training dataset is often required in training the algorithm. However, one main drawback of this approach is that its accuracy is heavily dependent on the representativeness of the extracted visible features and the generality of the learnt prediction algorithms. Since the training and test data are most likely to have substantial variations, the algorithm has a difficulty of capturing local properties of vegetation in new test data. (2) Invisible feature based approaches, which perform the vegetation segmentation primarily based on the spectral properties of vegetation and other objects in the invisible spectrum. Probably the most famous invisible feature is Vegetation Indices (VIs) that are designed to capture different generic reflectance characteristics between vegetation and other objects. However, similar to supervised prediction, VIs cannot incorporate local properties in new test data either. (3) Hybrid approaches, which combine both visible and invisible features for more robust vegetation segmentation. Although improved performance has been obtained by utilizing both types of features, hybrid approaches also suffer the drawbacks of both of them. It can be seen that existing solutions have limited capacity of representing a large number of different types of vegetation in new test data. One effective solution to this issue is to develop algorithms that can automatically adapt to local properties of objects in every test image.
2.1. Visible vegetation segmentation approach Visible approaches aim to segment vegetation from other objects, such as sky, road, river, soil, traffic sign and car, based on visual features of vegetation in the visual spectrum. These features can be represented by different aspects such as color, texture, shape, structure, texture and geometry. One merit of using visible features lies in the fact that they are highly consistent with human eye perception of real-world objects. They are also close to human understanding and analysis of the scene content in real-life scenarios. However, no matter designing new visible features or choosing from existing visible features, there is no guarantee that those features are capable of effectively representing the most discriminative characteristics of vegetation and robustly handling different types of environmental effects such as variations in lighting conditions and shadows of objects. As a dominant resource that human eyes reply on in the recognition of real-life objects, color is one of the most frequently used visual features in current work on vegetation segmentation. In reality, most types of vegetation can be primarily represented by a green or a yellow color. This makes it easily distinguishable from other objects such as road and sky using color information. However, finding a robust representation of vegetation color in unconstrained real-world conditions is still a difficult task. As an example, the vegetation is theoretically validated to have a green color in the HSV space in most environments. However, this may become untrue in scene data with sky or varied lighting conditions, such as under-exposure and shining effects. Many studies have investigated the suitability of different color spaces, such as CIELab (Blas et al., 2008), YUV (Zafarifar and de With, 2008), HSV (Harbas and Subasic, 2014c) and RGB (Harbas and Subasic, 2014b), for representing various objects. Unfortunately, no space has been found to have a stably superior segmentation results than others in various natural conditions. In fact, exploiting semantically dominant color features that can adapt to local properties of the test data has received relatively less attention (Junqing et al., 2005). Texture is another major type of visible features, which also contains crucial information for vegetation segmentation, particularly for vegetation with similar visual characteristics and in natural conditions where various environmental challenges are present. The texture representations can be obtained by various ways, such as performing wavelet filters (e.g. Gabor filters Nguyen et al., 2012a and Continuous Wavelet Transform (CWT) Harbas and Subasic, 2014c), extracting pixel intensity distributions (e.g. Pixel Intensity Differences (PIDs) Blas et al., 2008; Zafarifar and de With, 2008) and calculating pixel intensity variations in spatial neighborhoods (Nguyen et al., 2011; Schepelmann et al., 2010). Approaches also adopt statistic features to represent texture, such as entropy (Harbas and Subasic, 2014b), spatial statistics (Schepelmann et al., 2010), and superpixel-level statistical features (Balali and GolparvarFard, 2015). Table 1 shows a list of typical visible approaches presented in current studies for vegetation segmentation. These approaches are primarily based on a combination of multiple types of visual features, and a supervised learning process, which is adopted to train prediction algorithms using visible features extracted from the training data. To enhance TV scene quality, work (Zafarifar and de With, 2008) combined color, texture and geometric features for segmenting grass regions, including a 3D YUV-based Gaussian model, PIDs between adjacent pixels, and a position model. Work (Blas et al., 2008) also adopted PIDs as the texture features, which were further fused with L, a and b color channels to create texton histogram features for discriminating road and vegetation. Four types of statistic measures were used in Schepelmann et al. (2010) for discriminating between illuminated grass and artificial obstacles to find the location of drivable terrain. The Gabor filter based texture features were combined with a monochrome and two opponent color channels in Campbell et al. (1997) to segment outdoor objects using a self-organizing feature map. A multi-layer perceptron was further trained to classify the segmented regions into one of 11 object categories
In this paper, we present an Adaptive Texton Clustering Model (ATCM) that incorporates both generic and local properties of vegetation for robust vegetation segmentation from natural roadside images. Specifically, the ATCM integrates pixel-level supervised predication and cluster-level unsupervised texton occurrence frequencies into a superpixel-based voting strategy to derive classification decisions. For supervised prediction, pixel-level class probabilities of a test image are predicted using class-specific Artificial Neural Networks (ANNs), which represent class label likelihoods of all pixels based on generic properties of all objects learnt from the training data. To capture local properties of objects including vegetation in a test image, we present texton-based K-means clustering which classifies pixels in the test image into a group of local clusters and obtains the occurrence frequencies of textons over all pixels within every cluster. The cluster-level texton occurrence frequencies are then converted into class probabilities and further combined with pixel-level class probabilities using a linear weighting method. The final segmentation of vegetation from other objects is completed using a majority voting strategy over a set of oversegmented superpixels. The ATCM is novel in the sense that it utilizes pixel-level supervised prediction to reflect the generic features learnt from the training data, and cluster-level texton occurrence frequencies to represent local properties in the test data. As a result, information learnt from both the training data and the test image is fully considered to support the segmentation decision. We demonstrate promising performance of the ATCM for vegetation segmentation on two real-world roadside image datasets and object segmentation on the Stanford background dataset. The remainder of this paper is structured as follows. Section 2 reviews related work on vegetation segmentation and highlights our main contributions. Section 3 describes the proposed ATCM. The experimental results and discussions are given in Sections 4 and 5, respectively. Finally, Section 6 draws some conclusions.
2. Related work and contributions of this paper This section briefly describes existing related efforts on vegetation segmentation using ground-based roadside data. We categories these efforts into three approaches, including visible approaches, invisible approaches, and hybrid approaches. The main contributions of this paper are highlighted at the end of this section. 160
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
Table 1 A list of existing visible approaches to vegetation segmentation. Ref.
Color
Texture
Predictor
Object No.
Database
Data size
Acc. (%)
(Campbell et al., 1997)
RGB, O1, O2, R-G, (R+G)/2-B
Gabor filter, shape
SOM + MLP
11
Bristol database
3,751 I
61.1 80
(Harbas and Subasic, 2014b)
RGB
Entropy
SVM + MO
2
Croatia dataset
270 I
95.0
(Harbas and Subasic, 2014c)
RGB, HSV, YUV, Lab
2D CWT
SVM + MO
2
Croatia dataset
270 I
96.1
(Zafarifar and de With, 2008)
YUV
PID
Soft segmentation
2
Self-created
62 I
91
(Nguyen et al., 2012a)
O1, O2
NDVI & MNDVI, Gabor filter
Spreading rule
2
Self-created
2,000 I 10 V
95
(Liu et al., 2007)
H, S
Ladar
RBF SVM
2
Self-created
–
–
(Blas et al., 2008) Lab
PID
K-means
–
Brodatz dataset; Self-created
–
79
(Schepelmann et al., 2010)
Gray
Intensity, edge, neighborhood centroid
Clustering
2
Self-created
40 I
95 90
(Chowdhury et al., 2015)
Gray
LBP, GLCM
SVM, ANN, KNN
2
Self-created
110 I
92.7
(Bosch et al., 2007)
RGB, HLS, Lab
Co-occurrence matrix
Gaussian PDF + global energy
5 5 7
Outex, MA and HE datasets
41 I 87 I 100 I
89.9 90.0 86.8
(Zhang et al., 2015b)
O1, O2, O3
Color moments
ANN
6
DTMR
600 I
79.0
(Zhang et al., 2016a)
RGB, Lab
Color moments
SCSM
6 8
DTMR
50 I 715 I
77.4 68.8
Note: ‘—’ means not available. Abbreviations: DTMR — Department of Transport and Main Roads, LBP — Local Binary Patterns, GLCM — Gray-Level Co-occurrence Matrix, MO — Morphological Opening, SCSM — Spatial Contextual Superpixel Model, I — Image, V — Video. ‘‘Self-created’’ means the database was created by the authors of the corresponding paper. More details about the Bristol database can be found in Mackeown (1994).
in natural scenes. In Chowdhury et al. (2015), a majority voting strategy was adopted for classifying dense and sparse grass regions based on the prediction results of three classifiers (ANN, K-Nearest Neighbors — KNN, and Support Vector Machine — SVM) and features represented by co-occurrence of binary patterns. The pixel-level color intensity and moment features were extracted and combined to train an ANN classifier for segmenting roadside vegetation from other objects (Zhang et al., 2015b). In Zhang et al. (2016a), a Spatial Contextual Superpixel Model (SCSM) was presented to segment roadside vegetation by progressively merging those low confident superpixels into their closest neighbors, achieving promising results. Although existing visible approaches have demonstrated promising performance, they still suffer from three drawbacks: (1) assuming the availability of a set of effective visible features. However, designing visible features that can robustly represent the characteristics of vegetation under different environmental conditions is still a challenge. (2) The performance of supervised learning classifiers is heavily dependent on the effectiveness of visible features and the correct design of the learning process. (3) Incapacity of taking into consideration local properties of vegetation in the test data, which may have substantially different characteristics from the training data. In addition, it is also noted that the majority of current visible approaches have mainly focused on the discrimination between vegetation and non-vegetation, and their performance may deteriorate substantially when a large number of objects coexist in the same scene data.
mentioning that VIs are not restricted to the invisible spectrum and they can also be extracted in the visible spectrum such as the excess green index. Table 2 shows several typical existing invisible approaches to vegetation segmentation. In Bradley et al. (2007), pixel-wise comparisons between Red and near Infrared Ray (NIR) reflectance have been shown to provide a robust method of segmenting photosynthetic vegetation. To overcome illumination effects in natural environments, the NIR was extended to different versions such as the Normalized Difference Vegetation Index (NDVI) (Bradley et al., 2007), the Modification of NDVI (MNDVI) (Nguyen et al., 2012b), and a combination of NDVI and MNDVI (Nguyen et al., 2012c). The NDVI is based on a simply assumption of a linear hyperplane for vegetation segmentation. Nguyen et al. (2012b) experimentally validated that the hyperplane is better represented logarithmically rather than linearly. Based on this, they proposed the MNDVI and confirmed its better robustness against various illumination effects and higher accuracy for vegetation segmentation than the NDVI. However, the main drawback of the MNDVI is that it is still impacted by the softening red reflectance when it is used in an under-exposure or a dim lighting condition. However, it is observed that the NDVI has good performance in those circumstances, and thus work (Nguyen et al., 2012c) adopted a combination of them to achieve more robust performance. There are also many studies that have used remotely sensing techniques to segment and recognize objects, such as hyperspectral imaging techniques (Lu et al., 2014; Yuan et al., 2015), LIght Detection And Ranging (LIDAR) (Zhang and Grift, 2012; Andújar et al., 2016), Synthetic Aperture Radar (SAR) (Santi et al., 2017), ultrasonic sensors (Moeckel et al., 2017; Chang et al., 2017) and mobile laser scanners (Li et al., 2016). The features extracted from different types of remote sensors have also been fused to provide better segmentation results, such as the fusion of VIs and terrestrial laser data (Tilly et al., 2015), and fusion of ultrasonic and spectral sensor data (Moeckel et al., 2017). For recent surveys on those studies, readers can refer to Lu et al. (2016) and Galidaki et al. (2017). Compared with visible approaches, most existing invisible approaches emphasize on the use of VIs that are created based on the
2.2. Invisible vegetation segmentation approach Invisible approaches attempt to utilize the reflectance characteristics of chlorophyll-rich vegetation in the invisible spectrum to distinguish them from other objects. The theory behind the use of invisible features is that vegetation depends on chlorophyll to convert sunlight radiant energy into organic energy and the energy can show certain wavelength characteristics. Based on the theory, different types of VIs were proposed to emphasize different spectral properties of vegetation and other objects on the wavelength bands (e.g., near infrared). It is worth 161
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
Table 2 A list of existing invisible approaches to vegetation segmentation. Ref.
Feature
Predictor
Class
Database
Data size
Acc. (%)
(Bradley et al., 2007)
Density, RGB, NIR, and NDVI
Multi-class logistic regression
Veg, obstacle, ground
Self-created
2 physical environments
95.1
(Nguyen et al., 2012b)
MNDVI
Threshold
Veg vs. non-Veg
Self-created
5000 I, 20 V
91
(Nguyen et al., 2012c)
MNDVI, NDVI, background subtraction, dense optical flow
Fusion
Passable Veg vs. others
Self-created
1000 I
98.4
(Wurm et al., 2014)
Laser reflectivity, measured distance, incidence angle
SVM
flat Veg vs. drivable surfaces
Self-created
36,304 Veg 28,883 street
99.9
(Kang et al., 2011)
Filter banks in Lab and infrared
Joint boost + CRF
8 objects (road, sky, tree, car, etc.)
Self-created
2V
87.3
(Nguyen et al., 2011)
Ladar scatter features, intensity mean & std., scatter, surface, histogram
SVM
Veg vs. non-Veg
Self-created
500 I
81.5
Abbreviations: CRF — Conditional Random Field, Veg — Vegetation, non-Veg — non Vegetation, I — Image, V — Video. ‘‘Self-created’’ means the database was created by the authors of the corresponding paper.
reflectance characteristic of vegetation. The VIs have the advantage of containing a rich set of features that cannot be captured from the visible spectrum and are often more robust against environmental effects. Nevertheless, VIs are largely designed based on a set of sample data of vegetation and thus they cannot reflect local properties of vegetation in new data, which may substantially differ from those of vegetation in the training data. In addition, VIs also suffer from a low reliability of characterizing a large number of different types of vegetation in natural conditions. In general, it often requires specialized equipment, such as LIDAR and laser scanners, to capture invisible features, and this greatly increases the cost and difficulty of a direct deployment of the equipment in ground-based applications.
difficulty of capturing local characteristics in the test data and they often have a low tolerance against various variations of features in new scene data. By contrast, pixel merging is performed in an unsupervised way and purely based on local characteristics of neighboring pixels in local spatial neighborhoods, but it has not considered useful generic features of vegetation that can be possibly extracted from the training data. To address the above drawbacks, the proposed ATCM integrates supervised learning and unsupervised clustering towards adaptive vegetation segmentation. It is based on the assumption that both generic and local characteristics are important for accurate object segmentation. The ATCM is an extension of our previous work (Zhang et al., 2015a) by additionally taking into account local properties in the test data, and work (Zhang et al., 2016b) by including (a) more comprehensive reviews on related work, (b) more explanations and technical details about processing steps of the proposed ATCM approach, (c) more experimental results on the DTMR dataset and two new datasets — Croatia roadside grass dataset and Stanford background dataset; and (d) more performance analysis based on the experimental results. The concept of the ATCM was inspired by recent efforts on utilizing local contextual information (e.g., local consistency of pixel labels Fulkerson et al., 2009 and spatial dependency between region labels Singhal et al., 2003; MyungJin et al., 2012) to improve the performance of object labeling in complicated real-world scenes. However, a problem of these works is that the contextual and spatial constraints are generally based on knowledge learnt either from the training data or a specific environment, and thus their performance is difficult to be generalized to new data. There are also studies that generated statistical models, such as mixture Gaussian models (Kumar et al., 2003) and probability density functions (Bosch et al., 2007) of newly observation data, to provide refinements to pixel-level classification results. As collective classification over spatial regions has shown higher accuracy than pixellevel classification (Zhang et al., 2015a; Achanta et al., 2012), the ATCM utilizes superpixel-level voting to enforce a spatial constraint on the smoothness of class labels in neighboring pixels.
2.3. Hybrid approach To utilize the merits of both visible and invisible features, hybrid approaches combine them to achieve more robust segmentation results. In Nguyen et al. (2011), 3D scatter features were designed to represent the structure of vegetation in a spatial neighborhood of Ladar data, and they were further used in conjunction with histograms of HSV color channels to segment vegetation to achieve outdoor automobile guidance. In Kang et al. (2011), 20-D filter banks were used to extract texture features from four channels (L, a, b and infrared) in a system for object segmentation from road scenes. From the extracted features, multi-scale texton features were generated from larger neighboring windows to segment eight objects from real-world road video scenes, achieving a global accuracy of 87.3%. In Nguyen et al. (2012a), the NDVI and MNDVI were combined to determine the initial seed pixels from chlorophyll-rich vegetation regions. Vegetation pixel spreading was then performed based on the opponent color channels and Gabor features. For hybrid approaches, how to effectively combine visible and invisible features to achieve more robust performance in natural data is still a research topic that needs further investigations. To summarize, state-of-the-art efforts on ground-based vegetation segmentation are generally based on three categories of methods, including (1) supervised prediction algorithms using visible features extracted from the training data, (2) VIs designed based on invisible characteristics of a sample vegetation dataset, and (3) pixel merging strategies performed based on the visual similarity of adjacent pixels. The supervised algorithms heavily depend on the competency of the training data in covering most discriminative features of all test objects, which is unlikely to be the case in real-world scenarios. Similarly, predefined VIs have been designed based on spectral characteristics of objects in a training dataset and a threshold value needs to be set properly to ensure the accuracy of the results. The VIs have limited capacity of considering possibly big variations of the spectral characteristics in the test data. Thus, both supervised prediction algorithms and VIs have
2.4. Main contributions The main contributions of this paper are composed of: (a) We present a novel ATCM, which integrates supervised learning and unsupervised clustering techniques to consider both generic features of vegetation learnt from the training data and local properties in every test data. Thus, it achieves accurate vegetation segmentation results while retaining high robustness against variations in scene content and the environment. (b) We design a new texton-based clustering approach to represent local properties in every test image. The approach generates a set of class-specific textons from the training data, groups pixels in the 162
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
test image into a list of local clusters, and accumulates cluster-level texton occurrence frequencies to effectively represent local properties of different object categories in the test image. This is one of the earliest attempts to utilize local contextual information in the test data for adaptive segmentation of roadside vegetation. (c) We evaluate the proposed ATCM on two real-world roadside image datasets, including one was built based on the ground-based video data captured by the Department of Transport and Main Roads (DTMR), Queensland, Australia and the other is the Croatia roadside grass dataset. Additionally, we also report the results on the widely used Stanford background dataset. The ATCM has demonstrated very competitive performance compared to existing approaches. In addition, we also make the DTMR roadside image dataset publicly accessible to researchers in this field.
3.2. Extraction of color and texture Features As the first processing step of the ATCM, feature extraction is designed to collect a list of most discriminative and robust visual features of vegetation to distinguish them from other object categories (e.g., sky, road and soil). Various types of features can be potentially useful for the segmentation of vegetation from other objects, such as shape, geometry and structure. However, only two primary types of visual features are considered in the proposed ATCM approach, including color and texture. The extraction of both of them is because they often can reflect different aspects of properties of objects in real-world scenarios and thus, a combination of them can assist creating more effective and discriminative features of roadside objects. The features are used for the generation of color and texture textons. Color features. Finding effective and robust color features is not an easy task and still under active research. However, it is generally accepted that a suitable color space should keep a consistency with human eye perception because human eyes have shown good capacity of distinguishing different objects even under at very extreme environmental conditions. Thus, we choose the CIELab space due to its perceptually consistency with human vision and its promising results on object segmentation in existing work (Shotton et al., 2008). The RGB is also included as using a single Lab space may be insufficient to classifying some similar objects and a combination of the two spaces is anticipated to be more informative and discriminative. For an image pixel at a coordinate (x, y), we can represent its color features by:
3. Proposed ATCM This section describes details of the proposed ATCM, which integrates pixel-level class probabilities predicted by ANN classifiers and cluster-level class probabilities calculated based on texton occurrence frequencies for roadside vegetation segmentation. 3.1. System framework of ATCM Fig. 1 depicts the system framework of the proposed ATCM for roadside vegetation segmentation, with details of processing steps in a training stage and a test stage. At the training stage, a set of roadside image regions is created for six roadside objects based on the training dataset. From the cropped regions, both color features and filter bank based texture features are extracted from a local neighborhood of each pixel to represent visual properties of objects, and are further input into a K-means clustering algorithm to build class-specific color and texture textons respectively. The textons represent intrinsic visual features of each class and are used in the test stage to reflect different local properties in every test image. In addition, for each object category, a class-specific ANN classifier is also trained based on a fused feature set comprising of color and texture features, and is used to predict pixel-level class probabilities in every test image. The pixel-level class probabilities contain the discriminative information that is learnt from the training dataset. Given a specific test image, the test stage comprises four main processing steps: (1) feature extraction, (2) predicting pixel-level class probability using the trained ANN, (3) calculating cluster-level texton occurrence frequencies in the test image, and (4) performing superpixellevel voting to obtain a final class label for every test pixel. For every test pixel, color and filter bank based texture features are first extracted and fused to generate pixel-level class probabilities using the trained ANN classifiers. In parallel, all test pixels are classified into a group of color and texture clusters by K-means clustering, and each cluster is composed of a pixel subset which shares predominantly similar characteristics and is likely to belong to the same class. Pixels in each cluster are then mapped to the closest color and texture textons learnt at the training stage, and the occurrence frequencies of class-specific textons are then aggregated over every local cluster to calculate the cluster-level class probabilities. To segment vegetation from other objects, the two set of class probabilities at pixel-level and cluster-level are fused using a linear combination method over image pixels in every oversegmented superpixel individually, and the class label of every superpixel can be finally obtained using majority voting over all classes. The main merits of the ATCM include: (1) using pixel-level class probabilities to reflect generic characteristics of objects via supervised learning, (2) using cluster-level class probabilities to preserve different local properties of objects in every test image using unsupervised clustering, and (3) using superpixel-level voting to enforce a spatial constraint on the consistency of class labels in neighboring pixels to achieve smooth segmentation outputs.
Vcx,y = {𝑅, 𝐺, 𝐵, 𝐿, 𝑎, 𝑏}
(1)
where, R, G, B represent the red, green and blue color components respectively in a RGB color space, while L, a, b represent the lightness, green–red, and blue–yellow color components respectively in a CIELab color space. Texture features. There are a large number of different texture descriptors that have been proposed for extracting texture features for object segmentation. One popular way of generating texture features is using texture filters, and popular examples include multi-scale and -orientation Gabor filters, 48 filters of the Leung and Malik (LM) set, 13 filters of the Schmid (S) set, and 38 filters of the Maximum Response (MR8). In the proposed model, we use the 17-D filters originally used in Winn et al. (2005), which have demonstrated highly competitive performance on natural object segmentation. To be specific, these filter banks are composed of Gaussian filters with three scales (1, 2, 4) applied to L, a, and b channels, Laplacian of Gaussian filters with four scales (1, 2, 4, 8), and the derivatives of Gaussian filters with two scales (2, 4) for x and y axes on the L channel. By applying these filter banks to every image pixel, we can obtain 17 response features for every pixel at a coordinate (x, y): Vtx,y = {GL1,2,4 , Ga1,2,4 , Gb1,2,4 , LOGL1,2,4,8 , DOGL2,4,x DOGL2,4,y }
(2)
GL1,2,4 , Ga1,2,4 , Gb1,2,4
where, indicate the outputs of Gaussian filters with three scales (1, 2, 4) applied to L, a, and b channels respectively. LOGL1,2,4,8 indicates the outputs of Laplacian of Gaussians with four scales (1, 2, 4, 8) applied to L channel. DOGL2,4,x and DOGL2,4,y represent the outputs of derivatives of Gaussians with two scales (2, 4) for x and y axes respectively on the L channel. 3.3. Supervised prediction of pixel-level class probability The objective of pixel-level supervised prediction is to get the probability of a test pixel belonging to each of all class categories using a supervised learning algorithm. The pixel-level probability can then be combined with cluster-level probabilities and further aggregated at superpixel-level voting to obtain the class label of every test pixel. Towards this aim, we present a supervised learning process to learn generic properties of all objects from the training data. Rather than generating a single multi-class classifier which obtains probabilities for 163
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
Fig. 1. Framework of the proposed Adaptive Texton Clustering Model (ATCM) comprising a training and a test stage. During training, class-specific color and texture textons are established for each object from a set of carefully cropped image regions. For tests on a new image, cluster-level class probabilities are calculated based on texton occurrence frequencies in each cluster, while pixel-level probabilities are obtained using supervised ANN classifiers. The two class probabilities at pixel-level and cluster-level are then combined over pixels in every image superpixel, and finally pixels within every superpixel are classified into the class which has the largest probability amongst all class categories.
all objects in a single prediction, we adopt an alternative way of training a class-specific classifier for every object. Class-specific classifiers give more focus specifically on mapping the discriminative features of one class to its class label, while mapping features and their variations in all the left classes into a second class label. Thus, they are expected to be more robust against different levels of feature variations between classes and eventually improve the prediction accuracy. Class-specific classifiers generate a map of class probabilities for all test pixels and for each class separately, and they are more suitable for adaptive segmentation of objects in natural scenes (Bosch et al., 2007). In the proposed ATCM, a class-specific binary ANN classifier is trained individually for every class. However, it is worth noting that the prediction also can be achieved by other algorithms, such as SVM and random forests. Assume 𝐶𝑖 be the 𝑖th class (𝑖 = 1, 2, … , 𝑀) and M be the total number of classes. For a pixel 𝑝𝑥,𝑦 , its probability of belonging to 𝐶𝑖 is predicted using the 𝑖th ANN classifier: 𝑝𝑖𝑥,𝑦 = 𝑡𝑟𝑎𝑛(𝑤𝑖 𝑉𝑥,𝑦 + 𝑏𝑖 )
3.4. Cluster-level class probability calculation This part introduces the main contribution of the proposed ATCM, which calculates cluster-level class probabilities to represent local properties of objects in a new test image. The cluster-level probabilities can then be further integrated with pixel-level probabilities produced by ANNs in Section 3.3. For this aim, the ATCM employs a texton based adaptive clustering algorithm to categorize test pixels into a list of color and texture clusters, and then calculates texton occurrence frequencies in each cluster, which is further converted into cluster-level class probabilities. This process is designed based on the observation that although local properties of a test image may vary significantly from those of the training data, test pixels with predominantly similar characteristics are more likely to belong to the same category, and thus these pixels can be grouped into the same cluster to derive robust clusterspecific features over a pool of test pixels from the cluster. The clusterspecific features also help achieve a certain degree of robustness to environmental variations, for instance, an even change of lighting in the whole scene exerts little impact on the clustering results. The clusters represent similar characteristics of different groups of pixels in every test image, and the corresponding cluster-level texton occurrence frequencies are thus used for assisting vegetation segmentation. Therefore, a key feature of the proposed approach is that the local information from every test image is used to dynamically refine the approach to further improve the segmentation results. For a new test image, it is ensured that its scene content information is utilized to assist the decision-making process at the segmentation stage. The whole algorithm is composed of the following processing steps: (a) Generating color and texture textons. Textons, which are essentially clustered centers of feature descriptors such as filter bank responses, can be used to build an effective feature representation of visual appearance of objects. In most existing studies (Blas et al., 2008; Winn et al.,
(3)
where, 𝑉𝑥,𝑦 is the extracted features for the pixel 𝑝𝑥,𝑦 , 𝑡𝑟𝑎𝑛 indicates a three-layer feedforward ANN classifier with a ‘tan-sigmoid’ activation function 𝑡𝑟𝑎𝑛, trainable weights 𝑤𝑖 and constant parameters 𝑏𝑖 . A key parameter of ANN is the number of hidden layers, which is likely to directly impact the segmentation results. In this paper, we conduct experimental comparisons to investigate its impact and determine its value. Every class-specific binary classifier produces a probability map for its corresponding class, and for all classes, there are a total of M maps. In the predicted results of the ANN classifier, M class probabilities are obtained for an image pixel 𝑝𝑥,𝑦 : 𝑃 𝑀𝑥,𝑦 = {𝑝1𝑥,𝑦 , 𝑝2𝑥,𝑦 , … , 𝑝𝑀 𝑥,𝑦 }
(4) 164
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
Fig. 2. Graphical illustration of the way of generating cluster-level class probabilities in a test image. The test pixels are first grouped into a certain number of clusters to represent local properties in the image, and the color and texture features are mapped separately into the closest texton learnt from the training stage. The texton occurrence frequencies across all classes are then calculated in each cluster and further converted into cluster-level class probabilities.
2005), a set of generic textons is generated for all objects based on the training data and the extracted visual features can then be mapped to a closest texton, resulting in a histogram-based feature vector to represent properties of the whole image. Obviously, generic textons often have limited capacity of capturing specific characteristics of all classes and suffer from the difficulty of effectively handling confusion between object categories, particularly those with very similar visual characteristics in real-world situations such as tree leaves and grasses. To address this issue, we adopt the strategy of creating a unique dictionary of class-semantic color and texture textons separately for each of all classes and we anticipate that they are able to more effectively represent intrinsic class-specific features of each class category. As a result, the confusion between objects can be better handled by using class-specific textons compared with using generic textons, leading to improved classification results. Assume we have a total of M class categories, and for every pixel at (x, y) in a training image, its color and texture features are represented 𝑐 and 𝑉 𝑡 respectively. For all training pixels of the 𝑖th class by 𝑉𝑥,𝑦 𝑥,𝑦 𝐶𝑖 (𝑖 = 1,2,. . . , M ), we can obtain a total of K color textons and K texture textons using the popular K-means clustering by enforcing a minimization of Eqs. (5) and (6) respectively: 𝐸𝑐 = 𝐸𝑡 =
∑
(5)
𝑥,𝑦∈𝐶𝑖
2 | 𝑐 𝑐 | min |𝑉𝑥,𝑦 − 𝑇𝑖,𝑘 | | 𝑘 |
(6)
𝑥,𝑦∈𝐶𝑖
2 | 𝑡 𝑡 | min |𝑉𝑥,𝑦 − 𝑇𝑖,𝑘 | | 𝑘 |
∑
For the 𝑖th class, the learnt textons are then fused into a larger class-semantic texton vector. It is noted that two vectors can be created separately, one for color features and the other for texture features: 𝑐 𝑐 𝑐 𝑡 𝑡 𝑡 𝑇𝑖𝑐 = {𝑇𝑖,1, 𝑇𝑖,2 , … , 𝑇𝑖,𝐾 } and 𝑇𝑖𝑡 = {𝑇𝑖,1, 𝑇𝑖,2 , … , 𝑇𝑖,𝐾 }
(7)
By combining the class-semantic texton vectors for all M classes, we can get two texton matrices, which are for color and texture features, respectively. ⎧ ⎪ ⎪ 𝑇𝑐 = ⎨ ⎪ ⎪ ⎩
𝑐 𝑐 𝑐 , … , 𝑇1,𝐾 , 𝑇1,2 𝑇1,1 𝑐 𝑐 𝑐 , … , 𝑇2,𝐾 , 𝑇2,2 𝑇2,1
⋮ 𝑐 𝑐 𝑐 𝑇𝑀,1 , 𝑇𝑀,2 , … , 𝑇𝑀,𝐾
𝑡 𝑡 𝑡 ⎧ 𝑇1,1 , 𝑇1,2 , … , 𝑇1,𝐾 ⎫ ⎪ ⎪ 𝑡 𝑡 𝑡 ⎪ 𝑇2,1 , 𝑇2,2 , … , 𝑇2,𝐾 ⎪ 𝑡 ⎬ and 𝑇 = ⎨ ⋮ ⎪ ⎪ ⎪ 𝑡 ⎪ 𝑡 𝑡 , … , 𝑇𝑀,𝐾 ⎩ 𝑇𝑀,1 , 𝑇𝑀,2 ⎭
⎫ ⎪ ⎪ ⎬ (8) ⎪ ⎪ ⎭
Until now, two texton matrices have been generated and they represent class-semantic textons for all classes learnt based on the training data. In other words, they are supposed to reflect the most discriminative generic features for each of all objects. However, it is noted that the effectiveness of the textons is largely dependent on the quality of training data in representing discriminative properties between objects, and thus a careful development of the training data is a critical step, which will be explained in the experiment section. (b) Texton mapping and cluster-level texton occurrence frequencies. Once the class-semantic textons are generated, they can be used as the basic feature unit to build higher-level features for every test image to reflect local properties of the test image. Incorporating local properties of every test image has the benefits of creating more representative and discriminative features that are adaptive to both local characteristics of object categories present in the test image and holistic characteristics of the whole image such as illumination conditions, leading to more robust segmentation results. Towards this aim, we first group all pixels in the test image into a set of local clusters, which reflect different aspects of visual characteristics in the image. All pixels in every cluster are then mapped to one of the learnt color or texture textons, which is
𝑐 and 𝑇 𝑡 indicate the 𝑘th color and 𝑘th texture textons where, 𝑇𝑖,𝑘 𝑖,𝑘 respectively generated for the 𝑖th class 𝐶𝑖 (𝑘 = 1, 2,. . . , K), 𝐸𝑐 and 𝐸𝑡 represent the target error functions. The minimization process in Eqs. (5) and (6) is performed separately, yielding an independent set of color and texture textons. They are anticipated to reflect different characteristics of an object category and can be used complementarily to produce a more robust feature vector.
165
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
class: ( ) ∑ 𝑐 𝑐 𝑐 𝑂𝑖,𝑘 𝐶𝑞𝑐 = 𝜙(𝑉𝑥,𝑦 , 𝑇𝑖,𝑘 )
(12)
𝑥,𝑦∈𝐶𝑞𝑐
Then, we can accumulate all elements (i.e., occurrence frequencies) in the color texton occurrence matrix created for the 𝑖th class using Eq. (12), resulting in the overall color texton occurrence frequency of all pixels within 𝐶𝑞𝑐 for the 𝑖th class: 𝐾 ( ) ∑ ( ) 𝑐 𝑂𝑖𝑐 𝐶𝑞𝑐 = 𝑂𝑖,𝑘 𝐶𝑞𝑐
(13)
𝑘=1
By repeating the above process to texture features, a texture texton occurrence matrix can be obtained as well and it indicates the frequency 𝑡 is mapped for all pixels belonging to the that the 𝑘th texture texton 𝑇𝑖,𝑘 𝑖th class: ( ) ∑ 𝑡 𝑡 𝑡 𝑂𝑖,𝑘 𝐶𝑞𝑐 = 𝜑(𝑉𝑥,𝑦 , 𝑇𝑖,𝑘 ) (14)
Fig. 3. Graphical illustration of grouping all test pixels into a set of clusters (indicated by integer numbers) which represent different local properties (indicated by different colors) in the test image (left). The positions of these pixels are discarded after the clustering process (right). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
𝑥,𝑦∈𝐶𝑞𝑐
accomplished by comparing their distances to all textons in the feature space and finding the texton with the smallest distance. Because each texton belongs to only one class, this process actually classifies all test pixels into one of M classes. The occurrence frequencies of textons for all pixels are then calculated for every cluster, resulting in a set of cluster-level texton occurrence frequencies. The occurrence frequencies are further converted to class probabilities within a range of [0, 1]. The whole process is illustrated in Fig. 2 with detailed steps described as follows. The local characteristics of objects in every test image may vary substantially in terms of spatial layout, structure, appearance, context, etc. To capture the variations, we perform the K-means clustering 𝑐 and 𝑉 𝑡 of pixels in a test separately on color and texture features 𝑉𝑥,𝑦 𝑥,𝑦 image I to categorize them into two sets of local clusters: 𝑐 𝑡 𝐶 𝑐 = {𝐶1𝑐 , … , 𝐶𝑞𝑐 … , 𝐶𝑄 } and 𝐶 𝑡 = {𝐶1𝑡 , … , 𝐶𝑝𝑐 , … , 𝐶𝑄 }
𝐶𝑞𝑐
The total occurrence frequency of texture textons of all pixels within 𝐶𝑞𝑐 for the 𝑖th class can also be calculated: 𝐾 ( ) ∑ ( ) 𝑡 𝑂𝑖𝑡 𝐶𝑞𝑐 = 𝑂𝑖,𝑘 𝐶𝑞𝑐
To fully utilize both color and texture textons for object segmentation, we further combine their occurrence frequencies for the 𝑖th class: ( ) ( ) 𝑂𝑖𝑐,𝑞 = 𝑎 ∗ 𝑂𝑖c 𝐶𝑞𝑐 + 𝑂𝑖t 𝐶𝑞𝑐 (16) where, a stands for a weight given to color textons relative to a weight of ‘1’ for texture textons, and the value of a is set to 1 based on our previous results that color and texture textons play a very similar role for roadside object segmentation (Zhang and Verma, 2017). The combined occurrence frequency provides an indication of the likelihood that the 𝑞th color cluster 𝐶𝑞𝑐 belongs to each of all M classes, inferred from both color and texture features. By integrating the combined occurrence frequency of 𝐶𝑞𝑐 for all classes, we can get an occurrence frequency vector for 𝐶𝑞𝑐 , which reflects local properties of objects in the test image:
(9)
𝐶𝑝𝑡
indicate the 𝑞th color and 𝑝th texture clusters, and where, respectively (𝑞 = 1, 2. . . , Q; p=1, 2. . . , Q), and Q indicates the number of color or texture clusters. For simplicity, the same Q is used for color and texture features in the proposed ATCM. Different from the generation of textons where the K-means clustering is performed on features from all training data, the cluster grouping is completely based on pixel features in a test image, and thus each cluster represents local visual characteristics of a test pixel subset. Note that geometric position information is completely discarded in the cluster grouping as the clustering process is based on only the feature similarity between pixels which can be at any location of the test image as shown in Fig. 3. Suppose a total of m test pixels are mapped into the 𝑞th color cluster 𝐶𝑞𝑐 . For each of m pixels in 𝐶𝑞𝑐 , its color and texture features are obtained using Eqs. (1) and (2) respectively. Given the two texton matrices generated using Eq. (8), each of m pixels in 𝐶𝑞𝑐 can be mapped to the closest color texton and the closest texture texton based on Eqs. (10) and (11) respectively: ⎧ ‖ 𝑐 𝑐 ‖ 𝑐 − 𝑇 𝑐 ‖ = min ( ) ⎪1, if ‖ ‖𝑉𝑥,𝑦 ‖ 𝑖=1,2,…,𝑀; ‖𝑉𝑥,𝑦 − 𝑇 ‖ 𝑖,𝑘 ‖ 𝑖,𝑘 ‖ 𝑐 𝑐 ‖ 𝑘=1,2,…,𝐾 ‖ 𝜑 𝑉𝑥,𝑦 , 𝑇𝑖,𝑘 = ⎨ ⎪0, 𝑜𝑡ℎ𝑒𝑟𝑠 ⎩
(10)
⎧ ‖ 𝑡 𝑡 ‖ 𝑡 − 𝑇 𝑡 ‖ = min ( ) ⎪1, if ‖ ‖𝑉𝑥,𝑦 ‖ 𝑖=1,2,…,𝑀; ‖𝑉𝑥,𝑦 − 𝑇 ‖ 𝑖,𝑘 ‖ 𝑖,𝑘 ‖ 𝑡 𝑡 ‖ 𝑘=1,2,…,𝐾 ‖ 𝜑 𝑉𝑥,𝑦 , 𝑇𝑖,𝑘 = ⎨ ⎪0, 𝑜𝑡ℎ𝑒𝑟𝑠 ⎩
(11)
(15)
𝑘=1
𝑐,𝑞 𝑂𝑐,𝑞 = {𝑂1𝑐,𝑞 , 𝑂2𝑐,𝑞 , … , 𝑂𝑀 }
(17)
The pixel-level class probabilities obtained in Section 3.3 are within a range of [0, 1]. To enable a direct combination of the texton occurrence frequency with pixel-level class probabilities, the proposed approach converts the occurrence frequency vector to a probability-based vector, which is ranged between [0, 1] with a sum of 1 by dividing the twice number of all m pixels in 𝐶𝑞𝑐 : 𝑐,𝑞
𝑐,𝑞
𝑂 𝑂 𝑂𝑐,𝑞 { 𝑐,𝑞 } (18) 𝑃 𝑐,𝑞 = 𝑃1𝑐,𝑞 , 𝑃2𝑐,𝑞 , … , 𝑃𝑀 = { 1 , 2 ,…, 𝑀 } 2𝑚 2𝑚 2𝑚 where, 𝑃 𝑐,𝑞 indicates the class probability calculated for the 𝑞th color cluster 𝐶𝑞𝑐 . In a similar way, the class probability vector 𝑃 𝑡,𝑝 for the 𝑝th texture cluster 𝐶𝑝𝑡 can also be calculated: } { 𝑂𝑡,𝑝 𝑂𝑡,𝑝 𝑂𝑡,𝑝 𝑡,𝑝 𝑃 𝑡,𝑝 = 𝑃1𝑡,𝑝 , 𝑃2𝑡,𝑝 , … , 𝑃𝑀 = { 1 , 2 ,…, 𝑀 } 2𝑛 2𝑛 2𝑛
(19)
where, n indicates the number of pixels belonging to cluster 𝐶𝑝𝑡 , and 𝑂𝑖𝑡,𝑝 indicates the combined occurrence frequencies of color and texture textons for pixels belonging to 𝐶𝑝𝑡 and for the 𝑖th class (similar to 𝐶𝑞𝑐 in Eq. (18)). The 𝑃 𝑐,𝑞 and 𝑃 𝑡,𝑝 obtained in Eqs. (18) and (19) are based on color and texture textons respectively, and they indicate the likelihoods of every pixel in the clusters 𝐶𝑞𝑐 and 𝐶𝑝𝑡 belonging to each of M classes. They are calculated solely based on the distances of visual features between pixels in the test image; therefore, they represent different aspects of local characteristics of objects in the test image and their values normally
where, ‘1’ means the current texton is the closest texton while ‘0’ means it is not the closest texton. The Euclidean distance is employed as the similarity measurement between a test pixel and a learnt texton, for both color and texture features. By aggregating the mapped texton for every pixel in 𝐶𝑞𝑐 , we can obtain a color texton occurrence matrix, which indicates the frequency 𝑐 is mapped for all pixels belonging to the 𝑖th that the 𝑘th color texton 𝑇𝑖,𝑘 166
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
The resulting probability 𝑃𝑗𝑢 reflects the local properties of every test image, and it is further combined with the pixel-level probability 𝑃𝑗𝑠 , which reflects the generic features learnt from the training data:
vary for every test image. It should be noted that although the clusters are generated independently from every test image, the texton mapping and the calculation of texton occurrence frequency are still dependent on the training data. As a result, the cluster-level class probabilities are impacted by the effectiveness of the training data. In other words, the training data still plays an important role in the extraction of clusterlevel information. However, by incorporating the cluster-level local information in every test image, the cluster-level class probabilities can provide complementary information to assist the object segmentation decisions on top of the general information learnt from the training data. Therefore, it is expected that more robust segmentation results can be achieved, particularly for the situation where test images have big variations in object category, layout, background, environment, etc.
𝑆
𝑆 𝑝𝑖,𝑗𝑙
The objective of object segmentation is to obtain a correct class label for each of all test image pixels. At this stage, we have calculated the pixel-level and cluster-level class probabilities for every test pixel, and they indicate a pixel’s likelihoods of belonging to all classes predicted using supervised ANN classifiers based on generic features from the training data and using unsupervised K-means clustering based on local properties in the test image, respectively. They are further aggregated by a superpixel-level majority voting strategy to acquire a final class label for every superpixel in the test image. The superpixel-level voting strategy essentially enforces a spatial constraint on the smoothness of class labels of pixels within a local neighborhood, which generates robust segmentation by making a decision based on collective votes over a set of pixels. Since the calculation of pixel-level and cluster-level class probabilities has not considered geometric positions of pixels in images, the adoption of superpixel-level voting partially addresses this issue by collectively smoothing class labels of geometrically close pixels. In addition, the use of superpixel-level voting simplifies the segmentation problem from a larger amount (e.g., millions) of pixels into a small number (e.g., hundreds) of superpixels. For a test image I, we first segment all its pixels into a group of homogeneous superpixels using a fast graph segmentation approach (Felzenszwalb and Huttenlocher, 2004):
𝑆
𝑢 𝑃𝑖,𝑗
𝑠 + 𝑃𝑖,𝑗
(23) (24)
𝑁 ∑
𝑆
𝑝𝑖,𝑗𝑙
(26)
𝑗=1
The class label for pixels within 𝑆𝑙 can be finally obtained by finding the class owns the largest probability among all classes: 𝑆
𝑆𝑙 ∈ 𝑧th class if 𝑝𝑧 𝑙 =
𝑆 max 𝑝 𝑙 𝑖=1,2,…,𝑀 𝑖
(27)
It is noted that the above algorithm performs a collective decision for each superpixel in the test image via a majority voting strategy. The occurrence frequencies of color and texture textons of all pixels within every superpixel are effectively combined with the results of supervised prediction to make a final decision by considering collective support information in a contextual neighborhood of each pixel. Thus, we anticipate that the segmentation result is of a certain level of robustness against small noise and of a high level of consistency in class labels between neighboring pixels. 4. Experimental results This section reports the segmentation performance of the proposed ATCM evaluated on two natural roadside datasets — the DTMR image dataset and the Croatia roadside grass dataset, as well as the Stanford background benchmark dataset. We also compare the performance of ATCM with those of approaches in existing work.
(20)
4.1. Experimental datasets (1) The DTMR roadside image dataset (Verma et al., 2017) used in the experiments was created based on the video data that was collected by the DTMR, Queensland, Australia using vehicle-mounted cameras. The vehicle runs cross main state roads in Queensland every year, and has four cameras facing a front, left, right and rear direction respectively. The video data have a format of AVI and a frame resolution of 1632×1248 pixels. The dataset was selected from the video data captured using the left-view camera, which mainly capture roadside grasses, trees, soil, etc. The dataset can be downloaded at https://sites. google.com/site/cqucins/projects and it includes a training part and a test part. (a) The training part includes a total of 650 image regions manually cropped from 230 video frames, covering seven primary types of roadside objects, including brown and green grasses, tree leaf and stem, road, soil, and sky. Each region was cropped in such a way that it contains only one object category, but covers various types of the appearance and structure of this category as illustrated in Fig. 4. Each object has 100 regions with an exception for sky which has only 50 regions, because of relatively less variations in the visual appearance of sky. One benefit of manually cropping these regions is that humans can control the level of appearance variations of each object in natural conditions,
(21)
To distinguish from pixel-level probabilities predicted by ANN in Eq. (3), we use 𝑢 to indicate that 𝑃𝑗𝑢 is a superpixel-level probability vector and s to indicate that 𝑃𝑗𝑠 is a pixel-level probability vector for the 𝑗th pixel. Based on Eqs. (18) and (19), each element in Eq. (21) can be obtained using Eq. (22): 𝑐,𝑞 𝑡,𝑝 𝑐,𝑞 𝑡,𝑝 𝑢 𝑃𝑖,𝑗 = 𝑃𝑖,𝑗 + 𝑃𝑖,𝑗 = 𝑂𝑖,𝑗 ∕2𝑚 + 𝑂𝑖,𝑗 ∕2𝑛
=𝑏∗
𝑝𝑖 𝑙 =
where, 𝑆𝑙 indicates the 𝑙th superpixel and L indicates the number of superpixels in I. Assume 𝑝𝑗 is the 𝑗th pixel in a superpixel 𝑆𝑙 (𝑗 = 1, 2,. . . , N ) where N is the number of pixels in 𝑆𝑙 , and 𝑝𝑗 is classified into the 𝑞th color cluster 𝐶𝑞𝑐 and the 𝑝th texture cluster 𝐶𝑝𝑡 . We can obtain the class probabilities 𝑃𝑗𝑐,𝑞 in 𝐶𝑞𝑐 and 𝑃𝑗𝑡,𝑝 in 𝐶𝑝𝑡 for 𝑝𝑗 using Eqs. (18) and (19) respectively. It is likely that color and texture clusters may have different contributions to the segmentation decisions, however, in this paper, we treat them equally important and fuse 𝑃𝑗𝑐,𝑞 and 𝑃𝑗𝑡,𝑝 to obtain a combined superpixel-level probability vector 𝑃𝑗𝑢 using Eq. (21): 𝑢 𝑢 𝑢 𝑃𝑗𝑢 = {𝑃1,𝑗 , 𝑃2,𝑗 , … , 𝑃𝑀,𝑗 }
𝑆
where, 𝑝𝑠𝑗 can be calculated using Eq. (3), and b indicates the weight given to the cluster-level class probability. The b is set to 0.8 in this paper 𝑆 based on experimental comparisons. The resulting 𝑝𝑖,𝑗𝑙 includes support information for the classification decision of pixel 𝑝𝑗 that is learnt from both training data and the test image. 𝑆 Given the class probability 𝑝𝑖,𝑗𝑙 of every pixel 𝑝𝑗 within the 𝑙th superpixel 𝑆𝑙 in a test image, we can further obtain an overall probability vector by summing the class probabilities of all pixels in 𝑆𝑙 : ( ) 𝑆 𝑆 𝑆 𝑃 𝑆𝑙 = {𝑝1 𝑙 , 𝑝2 𝑙 , … , 𝑝𝑀𝑙 } (25)
3.5. Superpixel-level voting and segmentation
𝑆 = {𝑆1 , 𝑆2 , … , 𝑆𝐿 }
𝑆
𝑙 𝑃𝑗 = {𝑝1,𝑗𝑙 , 𝑝2,𝑗𝑙 , … , 𝑝𝑀,𝑗 }
(22)
where, m and n indicate the total number of pixels within 𝐶𝑞𝑐 and 𝐶𝑝𝑡 , 𝑢 indicates the probability of 𝑝 belonging to the 𝑖th respectively, and 𝑃𝑖,𝑗 𝑗 class based on both color and texture clusters and 1< 𝑖 < 𝑀. During the fusion, the same weight of ‘1’ is given to class probabilities of both color and texture clusters. 167
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
Fig. 4. Samples of image regions manually cropped for seven roadside objects from the DTMR roadside image dataset. For a purpose of illustration, the cropped regions are shown in the same size, but their actual resolution and shape may vary significantly in the dataset.
Fig. 5. Sample test images and their pixel-level ground truths from the DTMR roadside image dataset.
which assists the generation of more representative class-semantic textons. (b) The test part includes 50 images manually selected from a set of both video data from the DTMR and image data captured during our roadside field surveys. The inclusion of images from our field surveys is to increase the diversity of scene content in the test data, since these images were captured using a different camera and with different environmental settings from the DTMR data. We try to select those images which can be representative of real-world situations including common types of vegetation and other objects, and various environmental conditions as shown in Fig. 5. Pixel-wise class labels are also provided by manually annotating all pixels into seven categories of objects, including brown and green grasses, road, tree, soil, sky, and void (which indicates unknown or uncertain objects). The pixel-wise labels
serve as ground truths and are used in the calculation of segmentation accuracy. During the process of manually annotating ground truths of class labels, one big challenge is that it is often difficult to precisely label every pixel in small regions of tree stems. To handle this challenge, we combine regions of tree leaf and stem into the same category of tree. (2) The Croatia roadside grass dataset (Harbas and Subasic, 2014a). The dataset comprises 270 images that were captured along public city streets using a right-view camera. The main focus is given to roadside green grasses and road as displayed in Fig. 6. The images also contain other objects such as trees, vehicles, pedestrians, building, and advertisement boards that are commonly present in city streets, as well as different real-world environmental challenges such as shadows of objects and different lighting conditions. The images are 1920×1080 168
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
pixels in resolution and provided with pixel-wise ground truths of grass versus non-grass.
and the accuracy of vegetation segmentation accordingly. From Fig. 7, we can see that the ATCM outperforms four benchmark approaches for all the different numbers of textons tested on both the datasets, although there are small fluctuations in the global accuracy. The results confirm the advantage of considering local properties in the test image in conjunction with generic characteristics learnt by supervised classifiers in making accurate segmentation decisions. For both the datasets, ANN_P and Texton_P obtain higher global accuracies (4% and 1% respectively) than ANN_C and Texton_C. The result may imply that superpixel-level majority voting plays a more important role than cluster-level majority voting in making segmentation decisions. One reason is that cluster-level unsupervised learning has not considered geometric locations of pixels, which also convey important contextual information for object segmentation because neighboring pixels often exhibit strong consistency in the class labels. The results agree with our common knowledge that pixels located within a small spatial region tend to have a higher probability of coming from the same class than widely distributed pixels within a contextual cluster, which can be at any location in the image. By fusing both superpixel-level and cluster-level class probabilities, the ATCM produces the highest accuracy due to the consideration of both a spatial smoothness of class labels via superpixellevel majority voting and local properties of test image via incorporating cluster-level probabilities. Among all the numbers of textons, using 70 color-texture textons produces the highest accuracy of 76.5% on the DTMR roadside image dataset, while there are only marginal increases in the global accuracy when more textons are used and the accuracy is around 94.5% on the Cortia grass dataset. Thus, 70 color-texture textons are employed for further analysis on both the datasets. The small variations in the accuracy of the ATCM also show the advantage of fusing multiple types of information in achieving a higher global accuracy using only a small number of color and texture features. A slightly increasing tendency is observed when more textons are used on the Cortia grass dataset, however, our further analysis reveals that the global accuracy starts to decrease substantially after the number of textons reaches 110. One possible reason is that, once more textons are generated, each of these textons is forced to represent towards local features in an image, which may impact their capacity of representing general features of objects in unconstrained real-world conditions. (2) Impact of number of clusters on global accuracy performance. The second critical parameter of the proposed approach is the number of local clusters that are generated from every test image. It decides the manner and extent to which local properties of objects in a test image are represented using cluster-level class probabilities. If a relatively small number of clusters are used, pixels belonging to different objects are likely to be categorized into the same cluster, while in case of using a large number of clusters, pixels coming from the same object are subdivided into multiple clusters. However, both the above cases are not beneficial to making correct segmentation decisions, and thus a proper value regarding the number of clusters should be set. It can be observed from Fig. 8 that, when different numbers of clusters are used, the ATCM outperforms all other approaches on both the datasets, which implies that capturing local properties of objects in the test image can constantly contribute important information towards correct segmentation. The performance of the ATCM suffers little impact by using different numbers of clusters on the DTMR roadside image dataset, and gains only marginal increases in the global accuracy on the Croatia grass dataset. The results imply that using only a small number of color and texton clusters can represent the most important local properties in the test image. When six local clusters are used, the ATCM achieves the highest global accuracy of 76.5% on the DTMR roadside image dataset. Interestingly, the number of local clusters coincidently equals to the total number of object categories for segmentation. Although there is no validation, it can arguably infer that each cluster might learn the most discriminative features for each object. Using six clusters, the accuracy is 94.6% on the Croatia grass dataset. For both datasets, using a larger number of clusters leads to consistent increases to global accuracies for
4.2. Implementation details and evaluation metrics The images in both datasets are resized to 320×240 pixels to fast the processing step and facilitate the process of superpixel segmentation. For the graph-based superpixel segmentation approach, we follow the parameters recommended in Chang et al. (2012), i.e. 𝜎 = 0.5, 𝑘 = 80, and min = 80 for an image resolution of 320×240 pixels. To obtain an equal training dataset for every class to train class-specific ANN classifiers, we randomly select 120 pixels from each of all cropped regions (except for sky and tree), and cropped regions of tree leaf and stem are treated as the same category of tree. The ANN has a structure of 23-𝐻-1 neurons in three layers with the number of hidden neurons H determined via experimental comparisons. The popular Levenberg– Marquardt backpropagation algorithm is adopted for training the ANN with the following parameters: a goal error — 0.001, a maximum epoch — 100, and a learning rate — 0.01. The values of these parameters are determined based on our previous results (Verma et al., 2017; Zhang and Verma, 2017) on the DTMR roadside image dataset. A size of 7×7 Gaussian filters is used for generating texture features based on our previous experiments (Zhang et al., 2015a). The Euclidean distance is adopted as the similarity measurement during generating color and texture textons based on the training data, as well as grouping test pixels into local color and texture clusters using K-means clustering. The performance of the ATCM is validated using two measurement metrics, including global accuracy, which is measured by the classification of pixels of all test data and all classes, and class accuracy, which is measured by the classification of pixels of each class. The accuracy is calculated by performing pixel-level comparisons between automatically predicted class labels and manually annotated ground truths. Obviously, the global accuracy puts more emphasis on frequently occurring classes, and less on low frequent classes. On the other hand, the class accuracy gives an equal emphasis on every class and does not take into account their pixel frequencies. Thus, using both of them can more comprehensively reflect the performance. For the DTMR roadside image dataset, the performance is reported on the 50 test images, while on the Croatia grass dataset, we run 10 random cross-validations and in each validation, 90% images are randomly included in the training data, while the remainder 10% in the test data. The accuracy averaged over 10 validations is taken as the final result. 4.3. Global accuracy vs. system parameters This part investigates the robustness of the performance of the ATCM against different values of three key system parameters. Figs. 7, 8 and 9 show the global accuracy performance versus the numbers of color and texture textons, of local clusters and of hidden neurons in ANN classifiers respectively. Since the ATCM is based on both pixel-level probability predicted using ANN and cluster-level probability calculated based on texton occurrence frequencies, we compare its performance with those of four benchmark approaches, including (1) ANN based pixellevel probability (i.e., ANN_P), (2) texton based pixel-level probability (i.e., Texton_P), (3) ANN based cluster-level probability (i.e., ANN_C), and (4) texton based cluster-level probability (i.e., Texton_C). It is noted that the ATCM and the first two benchmark approaches perform pixel classification using a superpixel-level majority voting strategy, while the last two benchmark approaches are based on a cluster-level majority voting strategy. (1) Impact of number of textons on global accuracy performance. A critical parameter of the proposed ATCM is the number of color and texture textons that are created from the training data. It is a factor that specifies the number of representative dictionary keywords that should be learnt from the training data. As a result, they may impact significantly the calculated values of cluster-level class probabilities 169
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
Fig. 6. Image samples and their pixel-level ground truths from the Croatia roadside grass dataset.
Fig. 7. Impact of number of textons on global accuracy performance on the (a) DTMR roadside image dataset and (b) Croatia roadside grass dataset. System parameters: 6 color and 6 texture clusters and 16 hidden neurons (ANN).
the approaches of ANN_C and Texton_C, and this is largely because that using more clusters in the calculation of cluster-level probabilities would be able to reveal more detailed information about local properties of objects in the test image. It is noted that the performance of ANN_P and Texton_P are not impacted by using different numbers of clusters, and they outperform ANN_C and Texton_C, which again confirms that supervised learning has made a larger contribution to segmentation decisions than unsupervised clustering in the proposed ATCM approach.
using the ATCM with 23 hidden neurons. Large fluctuations in the global accuracies have been seen for both approaches of ANN_P and ANN_C, which perform object segmentation based on pixel-level and clusterlevel class probabilities, respectively. The results confirm big impact of the number of hidden neurons on the segmentation results of the ANN classifier, while the type of the decision voting strategy, i.e. superpixelvs. cluster-level majority voting, has only limited impact. By contrast, there are only little changes to the global accuracies of all approaches on the Croatia grass dataset, which is probably because the performance of ANN classifiers are insensitive to the number of hidden neurons used due to the inclusion of only two classes — grass and non-grass. For both datasets, the ANN_P has a higher accuracy than ANN_C, regardless of the number of hidden neurons used, indicating that segmentation using a superpixel-level voting strategy exhibits more robust and accurate results than using a cluster-level voting strategy.
(3) Impact of number of hidden neurons in ANN on global accuracy performance. The third parameter that we investigate is the number of neurons adopted in the hidden layer of class-specific ANN classifiers. It has direct impact on the performance of ANN classifiers and thus the calculated values of pixel-level class probabilities. Note that the performance of the approaches of Texton_P and Texton_C are not impacted by using different numbers of hidden neurons. From Fig. 9, we can observe that, among all approaches, the ATCM has the highest performance on both datasets for all numbers of hidden neurons evaluated, which indicates that combining pixel-level and cluster-level class probabilities helps achieve a stably high performance. When evaluated on the DTMR roadside image dataset, the highest global accuracy of 76.9% is obtained
From the above results, we can see that the proposed approach exhibits different segmentation performance on the DTMR and Cortia datasets. There are several factors that may lead to the differences in the performance between the two datasets: (1) there are six object categories on the DTMR dataset as opposed to only two object categories 170
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
Fig. 8. Impact of number of clusters on global accuracy performance on the (a) DTMR roadside image dataset and (b) Croatia roadside grass dataset. System parameters: 70 color and texture textons and 16 hidden neurons.
Fig. 9. Impact of number of hidden neurons in ANN on global accuracy performance on the (a) DTMR roadside image dataset and (b) Croatia roadside grass dataset. Parameter settings: 6 clusters and 70 textons.
on the Cortia dataset; (2) there are big variations in the environmental conditions on the DTMR dataset as opposed to relatively less varied environmental conditions on the Cortia dataset. Therefore, the DTMR dataset is expected to have more variations in the extracted features, which will directly impact the segmentation accuracy; and (3) there is a separate training set on the DTMR dataset, and it includes carefully selected cropped images for each object. Each object also has a roughly equal number of cropped images to ensure balanced training data between all objects. But there are no cropped images and no guarantee of balanced training data in the training set of the Cortia dataset. 4.4. Class accuracy and computational performance
the similarity of texture between them and the difficulty generating accurate pixel-wise ground truths for distinguishing them. In our ground truth annotation, it is often difficult to assign some grass pixels to a brown or green grass category based on visual eye observation. The results indicate that, to build automatic systems that are capable of segmenting vegetation from real-world scenes, it is of great significance to investigate more effective feature descriptors to discriminate objects with a similar color or texture structure. Brown grass, green grass and road have a close performance of around 75% accuracy. For the evaluations on the Croatia grass dataset, grass and non-grass have a close classification accuracy of around 95%, and both the classes should be equivalently handled in building systems for discriminating them.
Table 3 displays confusion matrices of all evaluated classes and their class accuracies obtained using the ATCM. Among all objects in the DTMR roadside image dataset, sky is the object that can be segmented the easiest with an accuracy of 96.3%, and the second easiest class is tree with an accuracy of 80.4%. In contrast, the most difficult object is soil which has only 46.6% accuracy. It is worth noting that as large as 43.8% soil pixels are incorrectly classified as brown grasses, which is probably because of the overlap on a yellow color between brown grass and soil pixels. Another similar observation is that green and brown grass pixels also contribute significantly to the misclassification error, owning to
Table 4 compares the class, average, and global accuracies of the ATCM to the four benchmark approaches. We can see that the ATCM is the best performer with the highest global and average accuracies on both datasets. When evaluated on the DTMR roadside image dataset, the ATCM achieves the highest class accuracy for the road class and has the same highest class accuracy to the ANN_S approach for the brown grass class. For other object classes, their highest accuracies are obtained by one of the four benchmark approaches, which performs slightly better than the proposed approach. For the Croatia grass dataset, the ATCM keeps the highest and the second highest class accuracies for grass and non-grass respectively. The results again confirm the 171
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
Table 3 Class accuracies and confusion matrices produced by the ATCM.
Table 5 Performance comparisons of approaches fusing different types of class probabilities on the DTMR roadside image dataset.
(a) DTMR roadside image dataset (global accuracy = 76.9%)
Brown Grass Green Grass Road Soil Tree Sky
Brown Grass
Green Grass
Road
Soil
Tree
Sky
74.7 10.8 17.4 43.8 6.5 0.3
15.7 76.0 1.5 6.7 3.1 0.0
2.1 1.1 75.9 1.1 6.1 1.7
2.7 0.4 4.8 46.6 0.2 0.7
4.8 11.7 0.1 1.8 80.4 1.0
0.0 0.0 0.3 0.0 3.7 96.3
Approach
Global Acc. (%)
pixel-level ANN + cluster-level texton (ATCM) pixel-level texton + cluster-level ANN pixel-level texton + cluster-level texton pixel-level ANN + cluster-level ANN
76.9 76.2 75.1 75.9
(b) Croatia roadside grass dataset (global accuracy = 94.6%)
Grass Non-grass
Grass
Non-grass
94.4 5.0
5.6 95.0
Table 4 Performance comparisons of the ATCM to four benchmark approaches. The best results are highlighted in bold. (a) DTMR roadside image dataset
ATCM ANN_C Texton_C ANN_P Texton_P
Brown Grass
Green Grass
Road
Soil
Tree
Sky
Average
Global
74.7 72.3 71.8 74.7 72.9
76.0 70.9 77.2 71.7 75.6
75.9 34.7 40.5 75.0 74.2
46.6 33.4 38.6 48.6 46.6
80.4 74.0 63.1 82.2 70.0
96.3 90.2 97.3 83.3 97.6
75.0 62.6 64.8 73.4 74.4
76.9 70.9 70.6 75.5 74.9
Fig. 10. Comparisons of the computational time of the proposed ATCM approach with four benchmark approaches on the DTMR and Croatia image datasets. The time is the average seconds required to get the segmentation results in a test image.
(b) Croatia roadside grass dataset
ATCM ANN_C Texton_C ANN_P Texton_P
Grass
Non-grass
Average
Global
94.4 93.9 94.3 93.7 94.1
95.0 93.6 92.4 95.6 94.5
94.7 93.8 93.4 94.7 94.3
94.7 93.8 93.6 94.5 94.3
other types of class probabilities can lead to a better performance? Table 5 compares global accuracies of four approaches that fuse either pixel- or cluster-level class probabilities obtained using either ANN or texton occurrence frequencies on the DTMR roadside image dataset. As can be seen, fusing class probabilities obtained using two different methods, i.e. texton and ANN, leads to higher global accuracy than that obtained using the same method, i.e. texton or ANN. This is primarily because pixel- or cluster-level probabilities from the same method often have significant overlaps in segmentation decisions, but those from different methods retain a diversity of segmentation decisions and thus a combination of them leads to more robust performance. Table 6 compares the segmentation results of the ATCM with those of previous approaches evaluated on the DTMR roadside image dataset and Croatia roadside grass dataset. From the table, we can observe that the ATCM has a better performance than the two texton-based approaches in Zhang et al. (2015a) and Zhang and Verma (2015) when evaluated using the DTMR dataset. The highest global accuracy achieved is 76.9%. As for the results on the Croatia grass dataset, the ATCM also produces higher global accuracy than the three approaches in Harbas and Subasic (2014a), which are based on CWT, Visible Vegetation Index (VVI), and Green–Red Vegetation Index (GRVI) features, respectively. It obtains a similar global accuracy to the approach in Harbas and Subasic (2014b), which uses RGB and entropy features. However, it should be noted that the results of existing approaches in Harbas and Subasic (2014b) and Harbas and Subasic (2014a) are based on an image resolution which is much higher than that used in the proposed ATCM (i.e., 1920×1080 vs. 320×240 pixels). The results indicate superior performance of the ATCM even under the situation where only low-resolution image data is available. This forms one main advantage for achieving real-time processing and handling the situation of low-quality data. The ATCM also outperforms the approaches in Harbas and Subasic (2014b), which uses color features (i.e, RGB, HSV or Lab) with an SVM classifier, indicating the necessity of incorporating texture features in achieving better results. Fig. 11 visually compares the segmentation results of the ATCM with two previous approaches (Zhang et al., 2015a; Zhang and Verma, 2015) on sample images from the DTMR roadside image dataset. The ATCM shows compelling advantages in correcting misclassification errors towards more robust segmentation. In the results of Zhang et al.
advantage of the ATCM in achieving accurate segmentation results for individual objects. It is also interesting to observe that some approaches seem to be more suitable for classifying specific types of objects. For instance, Texton_C performs poorly for tree but well for sky, while ANN_P performs well for tree, but poorly for sky. This implies that a proper combination of multiple classifiers might complement each other towards correct classification of some difficult objects, leading to better overall performance. Fig. 10 shows the computational time performance of the proposed ATCM approach compared to four benchmark approaches on the DTMR and Croatia datasets. The time is the average seconds required to perform object segmentation in a test image. All approaches were tested using the same computer with a 64-bit Windows 10 operating system, a Matlab 2018a platform, and a configuration of 2.50 GHz Intel Core i5 processor and 8 GB memory. It can be seen that, for both datasets, the proposed approach has the highest computational time than four benchmark approaches, which is within our expectation because the proposed approach includes both ANN-based and clusterlevel prediction. Compared with the ANN_P approach, which has the second best overall performance, the ATCM requires an additional time of less than 0.3 s. However, it should be noted that all approaches were developed using Matlab, which runs relatively slowly for image processing tasks. If the approaches were developed using more timeefficient languages such as C or C++, it is expected that the differences in the computational time can be reduced significantly. 4.5. Performance comparisons with existing approaches The ATCM is based on a combination of pixel-level class probabilities predicted by ANN classifiers and cluster-level class probabilities calculated based on texton occurrence frequencies. Whether combining 172
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176 Table 6 Global accuracy comparisons with existing approaches. Dataset
Approach
Resolution
Global Acc. (%)
DTMR
Proposed ATCM textons in superpixel neighborhoods (Zhang and Verma, 2015) textons in superpixels (Zhang et al., 2015a)
320 × 240 320 × 240 320 × 240
76.9 75.5 74.5
Croatia
Proposed ATCM RGB+entropy+SVM (Harbas and Subasic, 2014b) BlueSUAB+2D CWT (Harbas and Subasic, 2014a) VVI (Harbas and Subasic, 2014a) GRVI (Harbas and Subasic, 2014a) RGB+SVM (Harbas and Subasic, 2014b) HSV+SVM (Harbas and Subasic, 2014b) Lab+SVM (Harbas and Subasic, 2014b)
320 × 240 1920 × 1080 1920 × 1080 1920 × 1080 1920 × 1080 1920 × 1080 1920 × 1080 1920 × 1080
94.7 95.0 93.3 58.3 67.6 92.7 87.3 92.8
Fig. 11. Segmentation results of images from the DTMR roadside image dataset. (first row) raw roadside image; (second row) ground truths; (third row) results obtained using textons in superpixels (Zhang et al., 2015a); (fourth row) results obtained using textons in superpixel neighborhoods (Zhang and Verma, 2015); (last row) results of the ATCM.
(2015a) and Zhang and Verma (2015), a significant proportion of tree pixels (indicated by a blue color) are misclassified as road (indicated by a red color) due to a high level of color and texture similarity between them. These errors are successfully corrected using the ATCM due to the consideration of local properties in the test image in a texton-based clustering process, which groups tree and road pixels in the test image into separate clusters based on their visual characteristics. For the same reason, the ATCM also obtains better results of segmenting soil pixels than two benchmark approaches. Fig. 12 shows the segmentation results of the ATCM compared to two benchmark approaches, which are based on pixel- and cluster-level class probabilities respectively on the Croatia roadside grass dataset. Again, the ATCM demonstrates advantages over the benchmark approaches in correcting misclassified regions that have a similar color or texture with grasses such as trees in the fourth image, the light pole in the fifth image, and the bench in the first image, as well
as overcoming environmental variations such as shadows of light poles and trees in the second and last images respectively. We also evaluate the performance of the proposed ATCM approach on a widely used real-world dataset — the Stanford background dataset (Gould et al., 2009). The Stanford background dataset includes 715 outdoor scene images of eight classes, including sky, tree, road, grass, water, building, mountain and foreground object. Most of these object classes also present frequently in roadside scenes. The image pixels are manually annotated into one of eight classes or unknown object. Following Gould et al. (2009), five-fold cross validations are conducted to obtain the global segmentation accuracy: 572 are randomly selected for training and the remaining 143 images for test in each cross-validation. Table 7 shows and compares the global accuracy of the ATCM approach with the reported global accuracies of existing approaches. It can be seen that the proposed ATCM produces a very competitive performance 173
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
Fig. 12. Segmentation results of images from the Croatia roadside grass dataset. (first row) raw roadside image; (second row) ground truths; (third row) results obtained using pixel-level class probabilities predicted by ANN classifiers; (fourth row) results obtained using cluster-level class probabilities calculated based on texton occurrence frequencies; (last row) results of the ATCM.
Table 7 Performance comparisons with existing approaches on the Stanford background dataset. Reference
Global Acc. (%)
(Gould et al., 2009) (Munoz et al., 2010) (Tighe and Lazebnik, 2010) (Kumar and Koller, 2010) (Socher et al., 2011) (Lempitsky et al., 2011) (Farabet et al., 2013) (Sharma et al., 2014) (Bing et al., 2015) (Sharma et al., 2015) (Yuan et al., 2017) Proposed ATCM
76.4 76.9 77.5 79.4 78.1 81.9 81.4 81.8 81.2 82.3 81.7 81.4
As the scene content may vary significantly in different test images, considering local properties in each individual test image plays a crucial role in enhancing the robustness of the system by automatically adapting to local contextual information and environmental conditions in each test image. (b) Fusion of decisions from multiple classifiers reduces variations in the accuracy of the proposed ATCM, and helps it achieve higher and more stable performance using only a relatively small number of features. In addition, each class-specific classifier seems to be more suitable for classifying a specific object, and thus a combination of multiple classifiers would be a better choice in obtaining better overall performance for all objects, such as using an ensemble of ANN classifiers. It is noted that supervised learning has exhibited a better performance than unsupervised clustering in the proposed ATCM on both the DTMR and Croatia roadside image datasets. (c) In a decision-level fusion of pixel- and cluster-level class probabilities towards more robust classification, using a superpixel-based voting strategy appears to produce more accurate segmentation results than using a cluster-level voting strategy. This is probably because pixels within a spatial neighborhood (e.g. superpixels) naturally exhibit higher likelihoods of belonging to the same object category than those in a spatially unconstrained cluster. This indicates the necessity of enforcing a spatial constraint on the consistency of object labels in neighboring pixels to improve segmentation results. Our experimental results also show that it is advisable to adopt class probabilities predicted by different machine learning approaches rather than by the same approach, and this helps retain a diversity of classification decisions and leads to higher accuracy. (d) The overlaps in color and texture features between roadside objects such as between soil and brown grasses, and between green and brown grasses, pose a big challenge for robust object segmentation
compared to existing approaches. To be specific, it has a similar global accuracy (i.e., more than 81%) to approaches in Lempitsky et al. (2011), Farabet et al. (2013), Sharma et al. (2014), Bing et al. (2015), Yuan et al. (2017), and a 0.9% lower global accuracy than the approach in Sharma et al. (2015). The results indicate the proposed ATCM approach has a stable and competitive performance in terms of the global accuracy on real-world datasets. Fig. 13 shows the segmentation results on sample images. 5. Result discussions The main lessons that can be learnt from the experimental results include: (a) Considering both generic characteristics of objects learnt from the training data and local properties of objects in the test data is an effective way of designing robust vegetation segmentation systems. 174
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176
Fig. 13. Segmentation results of images from the Stanford background dataset (best viewed in color).
in both roadside datasets. Therefore, it is still necessary to design more effective and robust feature descriptors or exploit automatic feature extraction using techniques such as deep learning to handle confusion between those visually similar objects. Recent advances on deep learning networks such as convolutional neural networks show promising object segmentation results in various computer vision tasks (Zheng et al., 2016). (e) Comparisons results on the Croatia roadside grass dataset show that considering both texture and color features produces higher global segmentation accuracy than using only color features, including RGB, HSV, and Lab. The result confirms the significance of incorporating texture features in object segmentation from natural roadside images.
as the number of hidden layers in ANN and the number of textons, are determined primarily via experimental comparisons. It is worthy of investigating using more advanced technologies such as automatic model selection methods (Luo, 2016) to automatically determine the optimal values of these parameters. Acknowledgments This research was supported under Australian Research Council’s Linkage Projects funding scheme (project number LP140100939). We would like to thank the Department of Transport and Main Roads (DTMR), Queensland, Australia for allowing us to access their roadside video data.
6. Conclusions and future work References This paper presents an Adaptive Texton Clustering Model (ATCM) for segmenting vegetation from real-world roadside image scenes. The ATCM performs pixel-level supervised classification and texton based unsupervised clustering independently and further combines their prediction results in a superpixel-based majority voting strategy. A main advantage of the ATCM is that it considers both generic characteristics of vegetation learnt from the training data and local properties of vegetation in every test image, and further enforces spatial consistency of class labels within homogeneous superpixels to reduce misclassification and eventually achieve robust vegetation segmentation in natural scenes. Evaluations on three challenging real-world roadside image datasets confirm high performance of the ATCM, including recorded global accuracies of 76.9% and 94.7% respectively even using low-resolution images on the DTMR and Croatia roadside image datasets respectively, and a competitive global accuracy of 81.4% on the Stanford background dataset. There are still several possible extensions that can be made to the ATCM. First, the same set of color and texture features is adopted in the ATCM for generating class-semantic textons, dividing a test image into local clusters, and training ANN classifiers. It is worth exploiting the performance of employing different types of features in these processing steps to reduce possible overlapping and redundant features. Second, only RGB and Lab color features extracted at a pixel level are used. In real-life scenarios, natural objects are often characterized by statistical or structural features in a larger spatial range, and thus one possible future work is to extract and evaluate region based statistical features. Thirdly, the classification decision in the ATCM is based on majority voting over pixels within a superpixel. It is highly likely that a superpixel and its neighbors also come from the same object category in natural scenes, and thus another future work is to enforce the consistency of object categories in a larger neighborhood of every superpixel. Last but not least, some hyper parameters of the ATCM approach, such
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Su, x., sstrunk, S., 2012. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34, 2274–2282. Andújar, D., Escolà, A., Rosell-Polo, J.R., Sanz, R., Rueda-Ayala, V., FernándezQuintanilla, C., Ribeiro, A., Dorado, J., 2016. A LiDAR-based system to assess poplar biomass. Gesunde Pflanzen 68, 155–162. Balali, V., Golparvar-Fard, M., 2015. Segmentation and recognition of roadway assets from car-mounted camera video streams using a scalable non-parametric image parsing method. Autom. Constr. 49 (Part A), 27–39. Bing, S., Gang, W., Zhen, Z., Bing, W., Lifan, Z., 2015. Integrating parametric and nonparametric models for scene labeling. In: Computer Vision and Pattern Recognition (CVPR), IEEE Conference on. pp. 4249–4258. Blas, M.R., Agrawal, M., Sundaresan, A., Konolige, K., 2008. Fast color/texture segmentation for outdoor robots. In: Intelligent Robots and Systems (IROS), IEEE/RSJ International Conference on, pp. 4078–4085. Bosch, A., Muñoz, X., Freixenet, J., 2007. Segmentation and description of natural outdoor scenes. Image Vis. Comput. 25, 727–740. Bradley, D.M., Unnikrishnan, R., Bagnell, J., 2007. Vegetation detection for driving in complex environments. In: Robotics and Automation, IEEE International Conference on, pp. 503–508. Campbell, N.W., Thomas, B.T., Troscianko, T., 1997. Automatic segmentation and classification of outdoor images using neural networks. Int. J. Neural Syst. 08, 137–144. Chang, C., Koschan, A., Chung-Hao, C., Page, D.L., Abidi, M.A., 2012. Outdoor scene image segmentation based on background recognition and perceptual organization. IEEE Trans. Image Process. 21, 1007–1019. Chang, Y.K., Zaman, Q.U., Rehman, T.U., Farooque, A.A., Esau, T., Jameel, M.W., 2017. A real-time ultrasonic system to measure wild blueberry plant height during harvesting. Biosyst. Eng. 157, 35–44. Chowdhury, S., Verma, B., Stockwell, D., 2015. A novel texture feature based multiple classifier technique for roadside vegetation classification. Expert Syst. Appl. 42, 5047– 5055. Farabet, C., Couprie, C., Najman, L., LeCun, Y., 2013. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1915–1929. Felzenszwalb, P., Huttenlocher, D., 2004. Efficient graph-based image segmentation. Int. J. Comput. Vis. 59, 167–181. Fulkerson, B., Vedaldi, A., Soatto, S., 2009. Class segmentation and object localization with superpixel neighborhoods. In: Computer Vision (ICCV), IEEE 12th International Conference on, pp. 670–677.
175
L. Zhang, B. Verma
Engineering Applications of Artificial Intelligence 77 (2019) 159–176 Schaefer, M., Lamb, D., 2016. A Combination of Plant NDVI and LiDAR measurements improve the estimation of pasture biomass in tall fescue (Festuca arundinacea var. Fletcher). Remote Sens. 8, 109. Schepelmann, A., Hudson, R.E., Merat, F.L., Quinn, R.D., 2010. Visual segmentation of lawn grass for a mobile robotic lawnmower. In: Intelligent Robots and Systems (IROS), IEEE/RSJ International Conference on, pp. 734–739. Sharma, A., Tuzel, O., Jacobs, D.W., 2015. Deep hierarchical parsing for semantic segmentation. In: Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, pp. 530–538. Sharma, A., Tuzel, O., Liu, M.-Y., 2014. Recursive context propagation network for semantic scene labeling. Adv. Neural Inf. Process. Syst. 2447–2455. Shotton, J., Johnson, M., Cipolla, R., 2008. Semantic texton forests for image categorization and segmentation. In: Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, pp. 1–8. Singhal, A., Jiebo, L., Weiyu, Z., 2003. Probabilistic spatial context models for scene content understanding. In: Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society Conference on, pp. 235–241. Socher, R., Lin, C.C., Manning, C., Ng, A.Y., 2011. Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th international conference on machine learning (ICML), pp. 129–136. Tighe, J., Lazebnik, S., 2010. SuperParsing: scalable nonparametric image parsing with superpixels. In: Daniilidis, K., Maragos, P., Paragios, N. (Eds.), Computer Vision (ECCV). Springer, Berlin Heidelberg, pp. 352–365. Tilly, N., Hoffmeister, D., Cao, Q., Lenz-Wiedemann, V., Miao, Y., Bareth, G., 2015. Transferability of models for estimating paddy rice biomass from spatial plant height data. Agriculture 5, 538–560. Verma, B., Zhang, L., Stockwell, D., 2017. Roadside Video Data Analysis: Deep Learning. Springer. Winn, J., Criminisi, A., Minka, T., 2005. Object categorization by learned universal visual dictionary. In: Computer Vision (ICCV), Tenth IEEE International Conference on, pp. 1800–1807. Wurm, K.M., Kretzschmar, H., Kümmerle, R., Stachniss, C., Burgard, W., 2014. Identifying vegetation from laser data in structured outdoor environments. Robot. Auton. Syst. 62, 675–684. Yuan, Y., Fu, M., Lu, X., 2015. Substance dependence constrained sparse nmf for hyperspectral unmixing. IEEE Trans. Geosci. Remote Sens. 53, 2975–2986. Yuan, Y., Jiang, Z., Wang, Q., 2017. HDPA: Hierarchical deep probability analysis for scene parsing. In: Multimedia and Expo (ICME), IEEE International Conference on, pp. 313–318. Zafarifar, B., de With, P.H.N., 2008. Grass field detection for TV picture quality enhancement. In: Consumer Electronics (ICCE), Digest of Technical Papers. International Conference on, pp. 1–2. Zhang, L., Grift, T.E., 2012. A lidar-based crop height measurement system for miscanthus giganteus. Comput. Electron. Agric. 85, 70–76. Zhang, L., Verma, B., 2015. Class-semantic textons with superpixel neighborhoods for natural roadside vegetation classification. In: Digital Image Computing: Techniques and Applications (DICTA), International Conference on, pp. 1–8. Zhang, L., Verma, B., 2017. Superpixel-based class-semantic texton occurrences for natural roadside vegetation segmentation. Mach. Vis. Appl. 28, 293–311. Zhang, L., Verma, B., Stockwell, D., 2015a. Class-Semantic color-texture textons for vegetation classification. In: Arik, S., Huang, T., Lai, W.K., Liu, Q. (Eds.), Neural Information Processing. Springer International Publishing, pp. 354–362. Zhang, L., Verma, B., Stockwell, D., 2015b. Roadside vegetation classification using color intensity and moments. In: Natural Computation (ICNC), 11th International Conference on, pp. 1250–1255. Zhang, L., Verma, B., Stockwell, D., 2016a. Spatial contextual superpixel model for natural roadside vegetation classification. Pattern Recognit. 60, 444–457. Zhang, L., Verma, B., Stockwell, D., Chowdhury, S., 2016b. Aggregating pixel-level prediction and cluster-level texton occurrence within superpixel voting for roadside vegetation classification. In: Neural Networks (IJCNN), International Joint Conference on, 3249–3255. Zheng, L., Zhao, Y., Wang, S., Wang, J., Tian, Q., 2016. Good practice in cnn feature transfer. arXiv preprint arXiv:1604.00133.
Galidaki, G., Zianis, D., Gitas, I., Radoglou, K., Karathanassi, V., Tsakiri-Strati, M., Woodhouse, I., Mallinis, G., 2017. Vegetation biomass estimation with remote sensing: focus on forest and other wooded land over the Mediterranean ecosystem. Int. J. Remote Sens. 38, 1940–1966. Gould, S., Fulton, R., Koller, D., 2009. Decomposing a scene into geometric and semantically consistent regions. In: Computer Vision (ICCV), IEEE 12th International Conference on, pp. 1–8. Harbas, I., Subasic, M., 2014a. CWT-based detection of roadside vegetation aided by motion estimation. In: Visual Information Processing (EUVIP), 5th European Workshop on, pp. 1–6. Harbas, I., Subasic, M., 2014b. Detection of roadside vegetation using features from the visible spectrum. In: Information and Communication Technology, Electronics and Microelectronics (MIPRO), 37th International Convention on, pp. 1204–1209.. Harbas, I., Subasic, M., 2014c. Motion estimation aided detection of roadside vegetation. In: Image and Signal Processing (CISP), 7th International Congress on, pp. 420–425. Junqing, C., Pappas, T.N., Mojsilovic, A., Rogowitz, B., 2005. Adaptive perceptual colortexture image segmentation. IEEE Trans. Image Process. 14, 1524–1536. Kang, Y., Yamaguchi, K., Naito, T., Ninomiya, Y., 2011. Multiband image segmentation and object recognition for understanding road scenes. IEEE Trans. Intell. Transp. Syst. 12, 1423–1433. Kumar, M.P., Koller, D., 2010. Efficiently selecting regions for scene understanding. In: Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, pp. 3217– 3224. Kumar, S., Loui, A.C., Hebert, M., 2003. An observation-constrained generative approach for probabilistic classification of image regions. Image Vis. Comput. 21, 87–97. Lempitsky, V., Vedaldi, A., Zisserman, A., 2011. Pylon model for semantic segmentation. Adv. Neural Inf. Process. Syst. 1485–1493. Li, L., Li, D., Zhu, H., Li, Y., 2016. A dual growing method for the automatic extraction of individual trees from mobile laser scanning data. ISPRS J. Photogramm. Remote Sens. 120, 37–52. Liu, d.-x., Wu, T., Dai, B., 2007. Fusing ladar and color image for detection grass off-road scenario. In: Vehicular Electronics and Safety (ICVES), IEEE International Conference on, pp. 1–4. Lu, D., Chen, Q., Wang, G., Liu, L., Li, G., Moran, E., 2016. A survey of remote sensingbased aboveground biomass estimation methods in forest ecosystems. Int. J. Digit. Earth 9, 63–105. Lu, X., Wu, H., Yuan, Y., 2014. Double constrained nmf for hyperspectral unmixing. IEEE Trans. Geosci. Remote Sens. 52, 2746–2758. Luo, G., 2016. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw. Model. Anal. Health Inf. Bioinf. 5 (18). Mackeown, W., 1994. A Labelled Image Database and its Application to Outdoor Scene Analysis (Ph.D. thesis), University of Bristol. Moeckel, T., Safari, H., Reddersen, B., Fricke, T., Wachendorf, M., 2017. Fusion of ultrasonic and spectral sensor data for improving the estimation of biomass in grasslands with heterogeneous sward structure. Remote Sens. 9 (98). Munoz, D., Bagnell, J.A., Hebert, M., 2010. Stacked hierarchical labeling. In: Daniilidis, K., Maragos, P., Paragios, N. (Eds.), Computer Vision (ECCV). Springer, Berlin Heidelberg, pp. 57–70. MyungJin, C., Torralba, A., Willsky, A.S., 2012. A tree-based context model for object recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34, 240–252. Nguyen, D.V., Kuhnert, L., Jiang, T., Thamke, S., Kuhnert, K.D., 2011. Vegetation detection for outdoor automobile guidance. In: Industrial Technology (ICIT), IEEE International Conference on, pp. 358–364. Nguyen, D.V., Kuhnert, L., Kuhnert, K.D., 2012a. Spreading algorithm for efficient vegetation detection in cluttered outdoor environments. Robot. Auton. Syst. 60, 1498– 1507. Nguyen, D.V., Kuhnert, L., Kuhnert, K.D., 2012b. Structure overview of vegetation detection. A novel approach for efficient vegetation detection using an active lighting system. Robot. Auton. Syst. 60, 498–508. Nguyen, D.V., Kuhnert, L., Thamke, S., Schlemper, J., Kuhnert, K.D., 2012c. A novel approach for a double-check of passable vegetation detection in autonomous ground vehicles. In: Intelligent Transportation Systems (ITSC), 15th International IEEE Conference on, pp. 230–236. Ponti, M.P., 2013. Segmentation of low-cost remote sensing images combining vegetation indices and mean shift. IEEE Geosci. Remote Sens. Lett. 10, 67–70. Santi, E., Paloscia, S., Pettinato, S., Fontanelli, G., Mura, M., Zolli, C., Maselli, F., Chiesi, M., Bottai, L., Chirici, G., 2017. The potential of multifrequency SAR images for estimating forest biomass in Mediterranean areas. Remote Sens. Environ. 200, 63–73.
176