Hierarchical ELM ensembles for visual descriptor fusion

Information Fusion 41 (2018) 16–24 Contents lists available at ScienceDirect Information Fusion journal homepage: www.elsevier.com/locate/inffus Hi...

Download PDF

2MB Sizes 1 Downloads 67 Views

Report

PDF Reader
Full Text

Information Fusion 41 (2018) 16–24

Contents lists available at ScienceDirect

Information Fusion journal homepage: www.elsevier.com/locate/inffus

Hierarchical ELM ensembles for visual descriptor fusion Stevica Cvetkovic´ a,∗, Miloš B. Stojanovic´ b, Saša V. Nikolic´ a a b

Faculty of Electronic Engineering, University of Niš, Aleksandra Medvedeva 14, Niš 18000, Serbia College of Applied Technical Sciences Niš, Aleksandra Medvedeva 20, Niš 18000, Serbia

a r t i c l e

i n f o

Article history: Received 26 December 2016 Revised 9 June 2017 Accepted 27 July 2017 Available online 28 July 2017 Keywords: Feature fusion Extreme Learning Machine Hierarchical classiﬁers Scene classiﬁcation

a b s t r a c t Extreme Learning Machines (ELM) have been successfully applied to variety of classiﬁcation problems by utilizing a single descriptor type. However, a single descriptor may be insuﬃcient for the visual classiﬁcation task, due to the high level of intra-class variability coupled with low inter-class distance. Although several studies have investigated methods for combining multiple descriptors by ELM, they predominantly apply a simple concatenation of descriptors before classifying them. This type of descriptor fusion may impose problems of descriptor compatibility, high dimensionality and restricted accuracy. In this paper, we propose a hierarchical descriptors fusion strategy at the decision level (“late-fusion”), which relies on ELM ensembles (ELM-E). The proposed method, denoted as H-ELM-E, effectively combines multiple complementary descriptors by a two-level ELM-E based architecture, which ensures that a more informative descriptors will gain more impact on the ﬁnal decision. In the ﬁrst level, a separate ELM-E classiﬁer is trained for every image descriptor. In the second level, the output scores from the previous level are aggregated into the mid-level representation which is conducted to an additional ELM-E classiﬁer. The exhaustive experimental evaluation conﬁrmed that the proposed hierarchical ELM-E based strategy is superior to the single-descriptor methods as well as “early fusion” of multiple descriptors, for the visual classiﬁcation task. Additionally, it was shown that signiﬁcant accuracy improvement is achieved by integrating ensembles of ELM as a basic classiﬁer, instead of using a single ELM. © 2017 Elsevier B.V. All rights reserved.

1. Introduction In the last years, there have been great advances in natural scene image processing. The research was focusing both on the low-level tasks, such as denoising or segmentation, and high level ones, such as detection or classiﬁcation. A variety of algorithms have been developed for the classiﬁcation at the pixel level, however the problem becomes more complex at the level of the complete scene classiﬁcation. The goal of the scene classiﬁcation is to label an image according to a set of predeﬁned semantic categories (e.g. forest, river, mountain, desert, etc.). It is a challenging problem, because of large variability within a given class in the sense of content, color, scales and orientations. High intra-class variability could be coupled with low inter-class distance, a problem that grows even more as ﬁner classiﬁcation is required. Research on natural scene classiﬁcation has been focusing both on the use of suitable image descriptors and of appropriate classiﬁcation algorithms. A variety of image texture descriptors have been proposed in the literature [13,37,50], and applied to scene classiﬁcation. In order to make these descriptors more robust, it was found neces-

∗

Corresponding author. ´ E-mail address: [email protected] (S. Cvetkovic).

http://dx.doi.org/10.1016/j.inffus.2017.07.003 1566-2535/© 2017 Elsevier B.V. All rights reserved.

sary to include additional visual cues, such as color information. It has been employed to improve the performance of scene classiﬁcation algorithms due to the complementary characteristics among the color channels [43]. Although, there is an increasing amount of work on combining texture and color descriptors [4,7,17,27,45], effective fusion of descriptors by assessing their complementarity is still an open research problem in computer vision. This motivated us to explore the complementary visual information in order to boost the scene classiﬁcation performance. To make image descriptors more robust, we found it necessary to simultaneously include multiple visual cues (i.e. texture, color, etc.), by using appropriate fusion strategy. The fusion process can occur at the descriptor level, or at the decision level [2,44]. While descriptor-level fusion (i.e. „early fusion“) integrates heterogeneous descriptors together into a single vector, decision-level fusion („late fusion“) operates on output classiﬁcation scores of each individual descriptor and combines them into a ﬁnal decision. Despite its simplicity and computational eﬃciency, the early fusion approach may impose problems of descriptor compatibility, high dimensionality and restricted accuracy. The basic approach to the late fusion is to use a ﬁxed weight for each classiﬁer score and afterwards compute a weighted sum of the scores as the ﬁnal result. This assumes that all the classiﬁers share the same weight and is unable to consider the differences of the classiﬁer’s individual prediction

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24

capability. Therefore, in the proposed work, we focus on the late fusion of descriptors where an additional classiﬁer is trained to estimate the speciﬁc fusion weights for each separate descriptor. The proposed method is investigated in the context of the scene classiﬁcation task. As the basic classiﬁer, we consider a single hidden layer feedforward neural networks (SLFN), which is an alternative to the commonly used SVM [12]. Concretely, we investigate a recently introduced SLFN training algorithm, termed as Extreme Learning Machine (ELM) [20,24]. The choice of ELM classiﬁer is due to its extremly eﬃcient training procedure and highly accurate classiﬁcation performance. The main drawback of traditional artiﬁcial neural networks and SVM is their training speed, which has been a major issue for practical applications, especially when real-time output of the system is needed. The ELM drastically increases training speed of SLFN by randomly generating input weights and biases for hidden layer nodes, instead of iteratively adjusting their parameters by commonly used gradient-based methods. The output weights of the hidden layer are then analytically computed by a least squares method. Besides minimizing the training error, ELM ﬁnds the smallest norm of output weights and hence tends to give better generalization performance than gradient-based learning algorithms, such as backpropagation. Moreover, the ELM can “naturally” handle the multi-class classiﬁcation problem with the architecture of multiple output nodes equal to the number of pattern classes. This is an advantage compared to the widely used SVM method that applies one-versus-all or one-versus-one strategy to handle non-binary cases [40]. It would be highly beneﬁcial to study possibilities for ELM integration into the heterogeneous descriptor fusion scheme, as presented in this work. The ELM has already been applied to a variety of classiﬁcation-related problems including: texture classiﬁcation [26], protein sequence classiﬁcation [11], remote sensing image classiﬁcation [15,33], landmark recognition [8,10], etc. Compared to existing machine learning techniques, the ELM is conceptually simpler and computationally more eﬃcient while demonstrating high generalization capabilities. However, the random assignment of parameters introduces suboptimal input weights and biases into hidden layer that may result in unstable and non-optimal output. A natural way to overcome this drawback is to use an Ensemble of ELMs according to the established principles of randomized learners, such as Random Forest [5]. Several algorithms for the formation of ELM ensembles were recently proposed [9,34,42], including our Average Score Aggregation [14]. The main advantage of ensembles comes from the fact that combined outputs from several diverse learners can increase the generalization capabilities of a single classiﬁer used in the ensemble [18]. To further improve diversity, several learner-independent techniques such as resampling, label switching, and feature space partitions, could be applied [3]. Inspired by the two previous trends of descriptor fusion and ELM ensembles, we propose to couple them in such a way which allows ELMs to directly select those descriptors that best discriminate the target classes, from a set of descriptor candidates. Our approach for descriptor fusion is hierarchical. We propose a twolevel ELM based architecture which ensures that a more informative descriptors will gain more signiﬁcance in the ﬁnal decision. In the ﬁrst level, a separate ELM classiﬁer is trained for every image descriptor. Then, in the second level, the output scores returned by the ﬁrst level classiﬁers are aggregated to obtain the mid-level representation. Mid-level descriptor is then used as an input for the second level ELM classiﬁer, to produce the ﬁnal classiﬁcation result. In this way we allow a second level classiﬁer to directly favor those descriptors that best discriminate the target classes. To further improve accuracy of the method, we propose to integrate Ensembles of ELM as a basic classiﬁer, instead of a single ELM. En-

17

sembles of ELM are proven to be able to improve classiﬁcation accuracy largely, without signiﬁcant time consumption [9,14]. In this work, we successfully integrated Ensemble ELMs in the proposed hierarchical ELM architecture for the scene classiﬁcation task. As the main contribution of the paper we assume introduction of a novel descriptor fusion method that effectively tackles image intraclass diversity by proposing a hierarchical ELM based approach. Apart from the theoretical contribution, we performed extensive evaluation over the two public scene datasets which proved that the proposed algorithm can reach highly accurate results without computationally complex operations. A comparative evaluation demonstrates increased classiﬁcation accuracy of the proposed HELM-E method compared to the accuracy when separate descriptors are used, as well as to the early fusion of descriptors (i.e. descriptor concatenation). In addition, the experiments demonstrate high level of computational eﬃciency of the complete scene classiﬁcation pipeline. The reminder of the paper is organized as follows. Section 2 gives a brief overview of ELM and ensembles of ELMs for multi-class classiﬁcation, and then introduces the proposed method for hierarchical descriptor fusion which relies on ELM ensembles. Section 3 describes the extraction of the visual descriptors used in the proposed classiﬁcation scheme. Experimental results and discussion are presented in Section 4, while Section 5 draws conclusions and proposes ideas for future work. 2. Hierarchical fusion of Extreme Learning Machines (ELM) Fusion of classiﬁers aims to include mutually complementary individual classiﬁers which are characterized by high diversity and accuracy [47]. It is intuitive that increasing of diversity should lead to the better accuracy of the combined classiﬁer, but there is no formal proof of this dependency. Brown et al. [6] noticed that we can successfully ensure diversity by independent generation of individual classiﬁers based on random techniques. The advantage of using ELMs in the fusion is that its diversity comes naturally from randomness in its hidden layer of neurons. Additional increase in diversity of the proposed hierarchical method is provided by integrating ensemble of ELMs as a basic classiﬁer, instead of using a single ELM. We will ﬁrst give a brief overview of ELM and ensemble of ELMs, and afterwards describe the proposed hierarchical ELMbased algorithm for heterogeneous descriptor fusion (H-ELM-E). 2.1. ELM for multiclass classiﬁcation Let suppose that we have N training samples denoted as (x j , y j ), j = 1, . . . , N, where x j = [x j1 , x j2 , . . . , x jn ]T Rn represents the j-th training sample of the dimension n, and y j = [y j1 , y j2 , . . . , y jm ]T Rm represents the j-th training label of the dimension m, where m is the number of classes. In the context of the visual feature fusion, xj could be assumed as an image descriptor, while yj is an m dimensional binary vector of class labels, with value “1” at the position of the corresponding class, and value “0” at other positions. The output of an ELM, with L hidden neurons and activation function h(x) is deﬁned as

f xj =

L

βi h w i · x j + b i ; j = 1 , . . . , N

(1)

i=1

where h() is a nonlinear piecewise continuous activation function, βi ∈ Rm represents the weight vector connecting the ith hidden neuron and all the output neurons, wi ∈ Rn is the weight vector connecting the ith hidden neuron and all input neurons, and bi is the threshold of the ith hidden neuron. Although sigmoid activation function is the most commonly used in practical applications,

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24

18

Algorithm 1 ELM ensemble using Average Score Aggregation [14]. Given a number of classes - m, the sigmoid activation function - h(x), the number of hidden neurons - L, and the number of ELMs in an ensemble - K (every ELM uses the same values of m, h(x), L and K): Training Input: A training set S = { (x j , y j )}| x j ∈ Rn , y j ∈ Rm , j = 1, ..., N} with N instances of n-dimensional descriptors and m-dimensional output score. for k = 1 to K 1. Generate random input weights wi and biases bi , i = 1, ..., L 2. Assign the input weights wi and biases bi to ELM(k). 3. Calculate the hidden layer output matrix Htrain (k) using the complete S, according to (4). † (k )Y using the complete S. 4. Compute β (k ) = Htrain endfor Testing Input: A test set S = {x j }|x j ∈ Rn , j = 1, ..., N } with N’ instances of n-dimensional descriptors. for k = 1 to K 1. Calculate the hidden layer output matrix Htest (k) using Eq. (4), for new instances from the test set S’. 2. Obtain output matrix Ytest (k ) = Htest (k )β (k ) of dimensionality N × m. endfor Sum up all K output matrices Ytest = Kk=1 Ytest (k ). For every test instance (i.e. Ytest row), compute the class label as the index of the maximal value in the row.

other activation functions can also be applied (Gaussian, wavelet, hyperbolic tangent, etc.) [21]. According to ELMs universal approximation property, it is able to solve any regression problem with a desired accuracy, if it has enough hidden neurons and training data to learn parameters of all the hidden neurons [21,22]. In addition, ELMs can be easily adapted for classiﬁcation problems [23], by predicting the class label as the index of the output node with the highest score. ELM theories also prove the classiﬁcation capability of wide types of networks with random hidden neurons. It was proven that, if tuning the parameters of hidden neurons could make an ELM to approximate any target continuous function, then the ELM with random hidden layer mapping can separate arbitrary disjoint regions of any shapes [23]. In the context of the ELM theory, wi and bi can be randomly and independently assigned a priori, without considering the input data [24]. An SLFN deﬁned with (1) has approximation capaL bilities with zero error means i=1 f i − yi = 0, i.e. there exist βi , wi and bi such that L

βi h w i · x j + b i = y j , j = 1 , . . . , N

(2)

i=1

If a value of 1 is padded to xj to make it a (d + 1)-dimensional vector, then the bias can be considered as an element of the weight vector. The equivalent compact matrix form of (2) for N input samples can be written as

Hβ = Y

(3)

where H represents the hidden layer output matrix of the complete neural network, with the ith column of H representing the ith hidden neuron’s output vector in regard to inputs x1 , x2 , ..., xN .

⎡

h ( w1 · x1 + b1 ) .. H=⎣ . h ( w1 · xN + b1 )

··· .. . ···

⎤

⎡

⎤

h ( wL · x1 + bL ) h ( x1 ) .. ⎦ = ⎣ .. ⎦ . . h(wL · xN + bL ) N×L h ( xN ) (4)

⎡ T⎤ β1 β = ⎣ .. ⎦

⎡ T⎤

.

β

T L

L×m

y1 and Y = ⎣ ... ⎦ yNT N×m

of training samples, so the output weights can be analytically determined by ﬁnding the unique smallest norm least-squares solution of the linear system (3), as β = H †Y . Where H† is the Moore– Penrose generalized inverse of matrix H, and H † = H T (H H T )−1 . To improve the generalization performance and make the solution more robust, a trade-off parameter C is usually added to each diagonal element of HHT . As a result, the output of the ELM classiﬁer is obtained as:

−1 I + HHT Y β = h x j HT

f xj = h xj

C

(6)

The predicted class label for a given test sample is the index of the output node with the highest output score. Let fi (xj ) denote the output function of the ith output node for the input sample xj . Then, the predicted class label of the sample xj is class(x j ) = argmax fi (x j ). 1≤i≤m

2.2. ELM ensembles (ELM-E) In order to overcome known drawbacks of ELM, such as unstable or non-optimal scores caused by the randomness of input weights, we will rely on ELM ensemble [9,10,19,30,46]. For aggregation of individual ELMs inside an ensemble we will apply our recently proposed Average Score Aggregation strategy [14]. It was shown to improve results of the commonly used majority voting strategy [9], when the number of ELMs is relatively small. The complete ELM ensemble algorithm with Average Score Aggregation strategy is given in Algorithm 1. It can be noticed that during the testing phase we ﬁrst perform the summation of output scores obtained from the individual ELMs in an ensemble, and afterwards compute the class labels. This is opposite to the commonly used majority voting approach [9], where the class labels are ﬁrst predicted by each individual ELM in an ensemble, and afterwards aggregated into a ﬁnal decision. In the case of a relatively small number of individual classiﬁers (K< 15), voting could result in unnecessary loss of accuracy caused by “binarization” which will completely lose decimal output scores for all non-winning classes. 2.3. The proposed hierarchical ELM ensembles (H-ELM-E)

(5)

In (4), h(x j ) = [h(w1 · x j + b1 ), . . . , h(wL · x j + bL )] is the output of the hidden nodes in response to the input xj . In most cases, the number of hidden neurons is much smaller than the number

The concept behind the proposed hierarchical ELM ensembles (H-ELM-E) for descriptor fusion is presented in Fig. 1 and described in the following. In the ﬁrst level our method will train several separate ELM ensembles (ELM-E), each using a different type of image descriptor. In the second level, a single ELM ensemble is trained on output scores obtained from the ﬁrst level classiﬁers.

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24

19

Fig. 1. Overview of the proposed hierarchical ELM ensembles classiﬁcation method (H-ELM-E).

This way, during the ﬁrst stage each ELM ensemble will learn how to separate classes based on a single descriptor, while the second level ELM ensemble will learn an optimal combination of descriptors for a given class. The proposed scheme is presented in Fig. 1. Let the training set contains N labeled images with D descriptor vectors extracted for each image j: x(j1 ) , x(j2 ) , . . . , x(jD ) . At the ﬁrst level, we will separately train D basic ELM-E classiﬁers, one for each descriptor. Let’s denote the outpout classiﬁcation score vec

tors of D separate classiﬁers as f (j1 ) , f j (2 ) , . . . , f (jD ) . Dimensionality of every output score vector is equal to the number of classes M into which the images have to be classiﬁed. To form the input into a second level ELM-E classiﬁer, the output probabilities of the ﬁrst level classiﬁers are concatenated into a mid-level descriptor f j = [ f (j1 ) , f (j2 ) , . . . , f (jD ) ]. The dimensionality of a new mid-level descriptor is now DM. The second level ELME classiﬁer is trained using the previous mid-level descriptor fj as the input, while sharing the same output labels as in the ﬁrst level. Input weights of the second level ELM-E are again independently assigned to random values, during the training. Output gives ﬁnal image classiﬁcation score vector, with the maximum score corresponding to the predicted class label. One should note that input weights for all ELM-E classiﬁers, in both levels of the H-ELM-E network, are randomly and independently assigned according to the ELM theory [24]. Besides that, the same training set is used for all ELM-E classiﬁers in both levels. In the concrete case, to take advantage of different types of image descriptors, we will generate four ELM ensembles in the ﬁrst level. Each ELM ensemble will be trained with an independent descriptor described in the following section. This approach allows that each ELM ensemble in the ﬁrst level learns to classify input images based on a single type of possibly “dominant” descriptor for that class. In order to learn importance of each descriptor within the class, we conducted the concatenated output scores to the second level ELM ensemble which will perform ﬁnal prediction of the class label. Note that no preprocessing step has to be done on heterogeneous input image descriptors (e.g. “z-score”), since every input descriptor is ﬁrst separately classiﬁed and no mixing is done at the descriptor level. This is advantage compared to early

descriptor fusion methods which require appropriate rescaling of the heterogeneous descriptors. 3. Image descriptors We will now provide a brief overview of the image descriptors which are used in the context of the proposed classiﬁcation method. The descriptors were chosen according to computation speed, dimensionality, robustness and effectiveness. We experimented with two texture descriptors (described in 3.1 and 3.2) and two color descriptors (described in 3.3 and 3.4). However, the proposed hierarchical fusion method is generic enough to be used with combination of other appropriate image descriptors, where increased number of descriptors is expected to improve accuracy of classiﬁcation at the price of reasonable complexity expansion. 3.1. Binary Gabor Patterns – BGP The Binary Gabor Pattern (BGP) has recently been introduced for the texture classiﬁcation problem [50]. The central idea of the BGP is to combine multiple Gabor ﬁlter responses at the same pixel location and encode them in a rotation invariant manner. A 2D Gabor ﬁlter at position (x, y) for a given orientation θ , including a real and imaginary term, can be expressed as:

1 gσ , γ ,λ (x, y, θ ) = exp − 2

x

2

σ2

+

y

2

(γ σ )2

2π x λ

exp j

(7)

where x = xcosθ + ysinθ , y = −xsinθ + ycosθ , σ is the standard deviation of the Gaussian envelope which speciﬁes the spatial width of the ﬁlter, γ is the spatial aspect ratio which determines the ellipticity of the ﬁlter (typically set to 2), and λ is the wavelength of the ﬁlter (speciﬁed in pixels). To form a robust local binary descriptor for every pixel location, the image is ﬁrst convolved with n oriented Gabor ﬁlters gσ , γ , λ (x, y, θ i ) determined by discrete orientations θi = iπ /n, i = 0, . . . , n − 1. All of the n ﬁlter responses at pixel location (x, y) are ﬁrst binarized and then concatenated to form a pixel’s binary

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24

20

representation denoted as BG(x, y, θ i ). This n-bit binary representation at pixel (x,y) could be encoded as an integer number, referred as rotation sensitive Binary Gabor Pattern using n orienta −1 tions: BGP (x, y, n ) = ni=0 BG(x, y, θi ) · 2i . Inspired by [37], it was noted that the bitwise shifted values of BGP (x, y, n) represent the same pattern, but rotated at a certain angle. Therefore, to achieve the rotation invariant descriptor denoted as BGP(x, y), shifted BGP’ values should be grouped together, as described in [50]. In case of n = 8 orientations, an initial set of 28 = 256 values of BGP (x, y, 8) will be reduced to 36 rotation invariant BGP(x, y) values. For n = 6 there will be 14 BGP(x, y) values, while for n = 4 will be 6 BGP(x, y) values. After a BGP(x, y) is computed at every pixel location, a global image descriptor is computed as a L1 normalized histogram. Note that for a grayscale image, the two BGP descriptors are extracted for the pair of even and odd Gabor ﬁlter (real and imaginary term in Eq. (7)), and concatenated to form a global image descriptor. To exploit additional characteristics of a color image at multiple scales, we used the approach presented in [14]. Concretely, a color image is ﬁrst converted into YCbCr color space, and an even and odd BGP descriptor are extracted for every color channel separately. We use the following parameters: nY = 6 orientations for the Y channel (28-dimensional descriptor), nCb = nCr = 4 orientations for Cb and Cr channels (12-dimensional for Cb and 12dimensional for Cr). This results in a robust 52 dimensional image descriptor of a color image. To include details at multiple scales, BGP descriptors are extracted over the original image and two down-scaled images, and concatenated into a ﬁnal 3 × 52 = 156 dimensional descriptor.

3.2. Local Binary Patterns – LBP Local Binary Pattern (LBP) is a popular visual descriptor computed using the LBP operator that captures local appearance around a pixel. It was introduced in [36] for the texture classiﬁcation problem, and extended to general neighborhood sizes and rotation invariance in [37]. Since then, LBP has been extended and applied to variety of applications [25,51]. The local LBP descriptor centered at pixel fc is an array of 8 bits, with one bit encoding each of the pixels in the 3 × 3 neighborhood. Each encoded neighbor bit is ﬁrst set to ‘1’ or ‘0’, depending on whether the intensity of the corresponding pixel is greater than the intensity of the central pixel. To form a binary array, neighbors are scanned in anti-clockwise order, starting from the one at the most right position. The binary array is then converted to a decimal number, representing a LBP value of the central pixel. If we denote nearest neighbors of the central pixel fc as fi , i = 0, . . . , 7, a LBP descriptor could be mathematically computed as:

LBPc =

7

S ( fi − fc )2i ; where S ( fi − fc ) = 1 i f fi > fc , and

3.3. Color Layout Descriptor - CLD The Color Layout Descriptor (CLD) has been designed to compactly represent the spatial color layout of an image [35,41]. It is obtained by extracting local representative colors over nonoverlapping image blocks, and compressing them using a 2D Discrete Cosine Transformation (2D-DCT). The descriptor is characterized by eﬃcient extraction, compact representation and invariance to resolution changes. Several studies have shown its effectiveness for image retrieval or classiﬁcation [16]. We extracted CLD in a slightly modiﬁed way, directly on YCbCr color channels, without any quantization step at the end. For every color channel c ∈ {Y, Cb, Cr} the extraction process starts with a spatial partitioning step, where the color channel is divided into 8 × 8 = 64 non-overlapping blocks, to guarantee resolution invariance. Then, a single representative block value bc (i, j) is computed by the simple averaging of pixels inside the block(i, j ), i = 0, 1, . . . , 7; j = 0, 1, . . . , 7. It provides suﬃcient accuracy with minimal computation costs. After this, 64 block representative values bc (i, j) for the channel c are passed to a 2D-DCT. Let us denote by Fc (u, v) a DCT coeﬃcient over the color channel c:

Fc (u, v ) = C (u )C (v )

√ C (u ) =

1/ 8 1/2

7 7

bc (i, j )cos

(2i + 1)uπ

i=0 j=0

, u=0 ; C (v ) = , u = 0

16

√

1/ 8 1/2

cos , ,

(2 j + 1)vπ

v=0 v = 0

16 (9)

The resulting DCT coeﬃcients Fc (u, v) are reordered in accordance with the zig-zag pattern. Finally, the most informative DCT coeﬃcients are retained for every color channel c: CLDc = [Fc (0, 0 ), Fc (0, 1 ), Fc (1, 0 ), Fc (2, 0 ), Fc (1, 1 ), Fc (0, 2 ), Fc (0, 3 ), Fc (1, 2 ), Fc (2, 1 ), Fc (3, 0 ), . . .]. The ﬁnal CLD descriptor is formed by the concatenation of CLD-s extracted for every color channel: CLD = [CLDY , CLDCb , CLDCr ] It was empirically shown that 6 to 10 dominant coeﬃcients per channel are suﬃcient for image classiﬁcation, so we will use a 22dimensional CLD representation (10 coeﬃcients for Y, 6 for the Cb, and 6 for the Cr color channel). It should be noted that the CLD is scale invariant due to block-based processing, so applying multiscale extraction as in the case of Gabor ﬁlters is not needed. 3.4. RGB histogram - RGB This is a baseline color descriptor representing the global color distribution in the RGB (Red, Green, Blue) color space [43,51]. The RGB histogram is constructed by concatenating the three normalized histograms from the R, G and B channels. We used 16 bins for computing histogram of each separate channel, and concatenated them into ﬁnal 48-dimensional representation. Experiments with increased number of bins, has demonstrated no gain in accuracy. 4. Experimental evaluation

i=0

S( fi − fc ) = 0 i f fi ≤ fc .

(8)

A LBP descriptor of the complete image is formed as a histogram of LBP values computed for every pixel of the image. Although there are 256 = 2^8 possible basic LBP patterns, this could be reduced into a smaller number of 58 rotation invariant patterns, as proposed in [37]. To form the ﬁnal multi-channel LBP descriptor which exploits color information of the image, we extracted LBP descriptor over all color channels and concatenated them into a single vector. Since default RGB color space shows large correlation among color channels, we used YCbCr color space which could be more effective for the image classiﬁcation.

4.1. Datasets To evaluate the performance of the proposed approach we carried out a number of experiments using the two publicly available scene datasets. We ﬁrst considered the recently introduced Landuse21 dataset [49]. It consists of 2100 aerial images extracted from a high-resolution aerial image, and labeled as belonging to 21 land use classes. For each of the 21 classes, 100 images are available at resolution of 256 × 256 pixels, with a large variation in terms of texture and color. An example image for each of the 21 classes is shown in Fig. 2. In recent years, many researchers have used this dataset, allowing for an extensive comparison of results

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24

21

Fig. 2. An example image for each of the classes from the test datasets: a) Landuse21 dataset [49], b) 8-scenes dataset [38].

Table 1 Classiﬁcation accuracy (in %) of the proposed fusion method H-ELM-E, compared to results when using separate descriptors or commonly used late fusion approach. Results are given for both test datasets (Landuse21 and 8-scenes). Standard deviation is given in brackets. Descriptor

Classiﬁer

Accuracy Landuse21 dataset

Accuracy 8-scenes dataset

BGP LBP CLD RGB BGP+LBP+CLD+RGB (early fusion) BGP+LBP+CLD+RGB (early fusion, PCA to 100 dim) BGP, LBP, CLD, RGB (late fusion) BGP, LBP, CLD, RGB (late fusion)

ELM-E (1-level) ELM-E (1-level) ELM-E (1-level) ELM-E (1-level) ELM-E (1-level) L = 10 0 0, K = 10 ELM-E (1-level) L = 10 0 0, K = 10 H-ELM-E (2-levels) L = 10 0 0, K = 10 H-ELM-E (2-levels) L = 20 0 0, K = 10

82.64 (±1.65) 83.42 (±1.76) 48.34 (±1.89) 68.54 (±1.52) 88.41 (±1.54) 83.12 (±1.59) 90.54 (±1.43) 91.45 (±1.25)

79.65 (±0.59) 80.34 (±0.73) 56.42 (±1.02) 47.62 (±0.86) 83.52 (±0.43) 81.43 (±0.62) 85.63 (±0.57) 86.62 (±1.12)

with the literature. We additionally performed tests on the challenging 8-scenes dataset [38]. The dataset consists of 2688 color images of outdoor scenes classiﬁed in 8 categories: coast, mountain, forest, open country, street, inside city, tall buildings and highways. Fig. 2 shows example images from the two datasets. To make our tests comparable to other published results, we randomly partitioned each class into training and testing subsets, as common in the literature. By default, in the case of Landuse21 dataset, we randomly partitioned each class into 80% training and 20% testing images. For 8-scenes dataset we used 100 images per class for training, and rest of the images for testing. The experiments were repeated 50 times using different random dataset partitions to achieve the robustness of results. The proposed method was implemented in MATLAB and experiments have been all carried out on a mainstream computer Intel Core i7 3.2 GHz. In the following subsections we give results of experiments on both datasets. 4.2. Results The ﬁrst set of experiments was conducted in order to determine optimal parameter settings for ELM-E as a basic classiﬁer unit in the proposed H-ELM-E classiﬁcation scheme. We measured the classiﬁcation accuracy depending on different ELM-E parameter conﬁgurations, by varying a number of hidden neurons (L) per ELM in an ELM-E, and number of ELMs in an ensemble (K). For all the following tests, ELM-E parameter C was ﬁxed to value C = 0.1. The results presented in Figs. 3 and 4 demonstrate that the increased number of neurons (L) as well as combination of several ELMs in an ensemble (K>1), could signiﬁcantly improve the accuracy. This stressed the importance of using ELM-E instead of a single ELM as a basic classiﬁer unit in the proposed hierarchical fusion method. Additionally, a single ELM could expose high instability of results caused by the randomness of input weights, while ELM-E is a much stable classiﬁer. Signiﬁcant result improvements

are noticeable at L= 1000 and K = 5. While additional increasing of L and K could further improve the results, it is noticeable saturation for L = 1000 and K = 10, after which the improvements are negligible. Therefore, for further tests we could use these values of parameters, as an optimal compromise between complexity and accuracy. To test the impact of multiple feature fusion using the proposed hierarchical method (H-ELM-E), we compared it to the classiﬁcation accuracy when separate descriptors are used, as well as to the commonly used early fusion of descriptors (i.e. descriptor concatenation). For this test we ﬁxed number of ELMs in an ensemble to K = 10. The results presented in Table 1 imply that classiﬁcation accuracy can be signiﬁcantly improved by integrating multiple complementary descriptors, using the appropriate fusion strategy. An important result is that the proposed H-ELM-E method could improve accuracy of the commonly used early fusion strategy by more than 3%. We performed an additional experiment to verify impact of PCA technique for dimensionality reduction in the process of early fusion (Table 1). The typical reason for PCA application is because not all classiﬁers can handle well high dimensional data and correlated variables. However, Neural Networks (e.g. ELM ensembles) proves itself to be successful in dealing with high dimensional and correlated image descriptors [1,29], where it is typically left to the network to simultaneously learn the parameters of the hidden units, along with how to combine them into ﬁnal decision score. Note that in all experiments with early fusion, we normalize the concatenated descriptor vector to zero mean and unit variance (“zscore”), before applying the classiﬁer. This has effect of averaging out the noise, by penalizing large values which could have a disproportionate impact on classiﬁcation. Our experimental results verify that the accuracy is reduced after applying PCA over concatenated descriptors, for this particular task (Table 1). At the same time, the reduced descriptor dimensionality will led to improved eﬃciency of the calculation.

22

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24

Fig. 3. Classiﬁcation accuracy of the H-ELM-E on “Landuse21” dataset, depending on a number of hidden neurons (L) per ELM, and number of ELMs (K) in an ensemble ELM-E.

Fig. 4. Classiﬁcation accuracy of the H-ELM-E on “8-scenes” dataset, depending on a number of hidden neurons (L) per ELM, and number of ELMs (K) in an ensemble ELM-E.

Next, the accuracy of the proposed H-ELM-E method is compared to the relevant results published in the literature. Additional tests measuring the inﬂuence of different training sample sizes on the accuracy are also included in Tables 2 and 3. For the Landuse21 dataset, the training size is varied within 80, 50 and 20 images per class, while for the 8-scenes dataset, the training sample size is varied within 20 0, 10 0 and 50 samples per class. Tables 1–3 show the generally high accuracy of our method (H-ELM-E).

Finally, we measured and compared time performance of the proposed method implemented in MATLAB, on a mainstream Intel Core i7 3.2 GHz. We measured the total training time for 1680 training images and total testing time for 420 images of the Landuse21 dataset. In Table 4 we gave time eﬃciency results of the proposed late fusion method (H-ELM-E), as well as an early fusion method (ELM-E). Results in Table 4 demonstrate high eﬃciency of H-ELM-E with high stability to the number of hidden neurons

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24 Table 2 Comparing accuracy of the proposed method (H-ELM-E) to the state of the art results on Landuse21 dataset, depending on the number of training images per class. Method

Number of training images per class

SPM [31] SPCK++ [49] MCBGP [14] MCMI [39] mCENTRIST [41] H-ELM-E (L = 10 0 0, K = 10) H-ELM-E (L = 20 0 0, K = 10)

80

50

20

74.00 77.38 86.52 88.20 89.90 90.54 91.45

– – 82.11 – – 86.76 86.81

– – 70.29 – – 75.93 76.15

Number of training images per class

LCVBP [32] PM [28] MCBGP [14] CENTRIST [48] H-ELM-E (L = 10 0 0, K = 10) H-ELM-E (L = 20 0 0, K = 10)

200

100

50

– – 84.87 – 87.89 87.92

76.00 82.00 82.31 86.22 85.63 86.62

– – 78.97 – 82.53 82.64

Table 4 Time eﬃciency of ELM-E based methods on the Landuse21 dataset, depending on parameters. Classiﬁer

Parameters

Training time (s)

ELM-E (early fusion) H-ELM-E (late fusion) H-ELM-E (late fusion) H-ELM-E (late fusion)

L = 10 0 0, L = 10 0 0, L = 20 0 0, L = 20 0 0,

1.3 5.6 17.3 36.2

K = 10 K = 10 K = 10 K = 20

While operating in real-time, our method achieves accuracy comparable to the state-of-the-art results. Although the tests outlined in this paper were conducted using the four descriptors in visible spectrum, the method is generic enough to be extended to other image descriptors and image modalities, such as Near InfraRed images – NIR [4]. Although we demonstrated high-level results using only manually engineered image descriptors, we believe that further improvements could be achieved by using descriptors learned in a supervised manner. Therefore, our primary research plan is to investigate the possibilities for integration of deep learned features [29] into our hierarchical classiﬁcation method. References

Table 3 Comparing accuracy of the proposed method (H-ELM-E) to the state of the art results on 8-scenes dataset, depending on the number of training images per class. Method

23

Testing time (s) 0.12 0.43 0.91 1.68

and parameter C = 0.1, as long as the number of hidden neurons is large enough. Hence, there is no need for exhaustive search of the ELM-E optimal parameters during the training phase, what improves eﬃciency of the method. In addition, ELM scales eﬃciently to large number of samples [23]. As expected, “early fusion” method achieves better eﬃciency than the proposed hierarchical method, at the cost of reduced accuracy. However, when designing a machine learning system, it is always a matter of compromise between eﬃciency and accuracy. Additional improvement of the H-ELM-E eﬃciency can be achieved by using a parallel GPU/CPU implementation [1]. Note that parallelization of an ELM-E ensemble is straightforward [20], since it uses multiple instances of the same ELM architecture. 5. Conclusion We presented a hierarchical method for visual descriptor fusion which reaches highly accurate results for scene classiﬁcation task. The experimental evaluation demonstrated the highest effectiveness and eﬃciency of the method. The high quality results of the proposed method is mostly due to three factors: 1) Multiple complementary image descriptors can successfully capture a variety of image patterns inside a class; 2) ELM ensembles within the proposed hierarchical feature fusion method demonstrate powerful classiﬁcation capabilities without time consuming operations. 3) Late fusion of classiﬁcation scores by second level ELM-E is able to learn importance of each separate descriptor, and predict ﬁnal class with high accuracy.

[1] A. Akusok, K.M. Bjork, Y. Miche, A. Lendasse, High-performance extreme learning machines: a complete toolbox for big data applications, IEEE Access 3 (2015) 1011–1025. [2] P.K. Atrey, M.A. Hossain, A. El Saddik, M.S. Kankanhalli, Multimodal fusion for multimedia analysis: a survey, Multimed. Syst. 16 (6) (2010) 345–379. [3] B. Ayerdi, M. Grana, Hybrid extreme rotation forest, Neural Networks 52 (0) (2014) 33–42. [4] A. Bosch, A. Zisserman, X. Muñoz, Scene classiﬁcation using a hybrid generative/discriminative approach, IEEE Trans. Pattern Anal. Mach. Intell 30 (4) (2008) 712–727. [5] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32. [6] G. Brown, J. Wyatt, R. Harris, X. Yao, Diversity creation methods: a survey and categorisation, Inf. Fusion 6 (1) (2005) 5–20. [7] M. Brown, S. Susstrunk, Multi-spectral SIFT for scene category recognition, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on Jun., 2011, pp. 177–184. [8] J. Cao, T. Chen, J. Fan, Fast online learning algorithm for landmark recognition based on BoW framework, in: 2014 IEEE 9th Conference on Industrial Electronics and Applications (ICIEA) Jun., 2014, pp. 1163–1168. [9] J. Cao, Z. Lin, G.-B. Huang, N. Liu, Voting based extreme learning machine, Inf. Sci. 185 (1) (2012) 66–77. [10] J. Cao, C. Tao, F. Jiayuan, Landmark recognition with compact BoW histogram and ensemble ELM, Multimed. Tools Appl. (2015) 1–19. [11] J. Cao, L. Xiong, Protein sequence classiﬁcation with improved extreme learning machine algorithms, BioMed Res. Int. (2014) 1–12. [12] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 27:1–27:27. [13] M. Crosier, L.D. Griﬃn, Using basic image features for texture classiﬁcation, Int. J. Comput. Vision 88 (3) (2010) 447–460. [14] S. Cvetkovic, M.B. Stojanovic, S.V. Nikolic, Multi-channel descriptors and ensemble of Extreme Learning Machines for classiﬁcation of remote sensing images, Signal Process. Image Commun. 39 (Part A) (2015) 111–120. [15] P. Du, A. Samata, P. Gamba, X. Xie, Polarimetric SAR image classiﬁcation by boosted multiple-kernel extreme learning machines with polarimetric and spatial features, Int. J. Remote Sensing 35 (23) (2014) 7978–7990. [16] H. Eidenberger, Statistical analysis of content-based MPEG-7 descriptors for image retrieval, Multimedia Syst. 10 (2) (2004) 84–97. [17] H. Ghassemian, A review of remote sensing image fusion methods, Inf. Fusion 32 (Part A) (2016) 75–89. [18] P.M. Granitto, P.F. Verdes, H.A. Ceccatto, Neural network ensembles: evaluation of aggregation algorithms, Artif. Intell. 163 (2) (2005) 139–162. [19] M. van Heeswijk, Y. Miche, E. Oja, A. Lendasse, GPU-accelerated and parallelized {ELM} ensembles for large-scale regression, Neurocomputing 74 (16) (2011) 2430–2437. [20] G. Huang, G.-B. Huang, S. Song, K. You, Trends in extreme learning machines: a review, Neural Netw. 61 (2015) 32–48. [21] G.-B. Huang, What are extreme learning machines? Filling the gap between Frank Rosenblatt’s Dream and John von Neumann’s puzzle, Cognit. Comput. 7 (3) (2015) 263–278. [22] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, Trans. Neur. Netw. 17 (4) (2006) 879–892. [23] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classiﬁcation, systems, man, and cybernetics, Part B, IEEE Trans. 42 (2) (2012) 513–529. [24] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1-3) (2006) 489–501. [25] S. ul Hussain, B. Triggs, Visual recognition using local quantized patterns, in: A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, C. Schmid (Eds.), Computer Vision – ECCV 2012: 12th European Conference On Computer Vision, Florence, Italy, 2012, pp. 716–729. [26] Y. Kaya, L. Kayci, R. Tekin, O.F. Ertugrul, Evaluation of texture features for automatic detecting butterﬂy species using extreme learning machine, J. Exp. Theor. Artif. Intell. 26 (2) (2014) 267–281. [27] F.S. Khan, R.M. Anwer, J. van de Weijer, M. Felsberg, J. Laaksonen, Compact color-texture description for texture classiﬁcation, Pattern Recognit. Lett. 51 (0) (2015) 16–22. [28] F.S. Khan, J. van de Weijer, S. Ali, M. Felsberg, Evaluating the impact of color on texture recognition, in: R. Wilson, E. Hancock, A. Bors, W. Smith (Eds.),

24

[29]

[30] [31]

[32]

[33]

[34] [35] [36]

[37]

[38] [39]

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24 Computer Analysis of Images and Patterns: 15th International Conference, CAIP 2013, York, UK, August 27-29, 2013, Proceedings, Part I, Springer Berlin Heidelberg, 2013, pp. 154–162. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classiﬁcation with deep convolutional neural networks, Advances in Neural Information Processing Systems 25, F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, eds., Curran Associates, Inc., 1097–1105. Y. Lan, Y.C. Soh, G.-B. Huang, Ensemble of online sequential extreme learning machine, Neurocomputing 72 (13-15) (2009) 3391–3395. S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, 2006, pp. 2169–2178. S.H. Lee, J.Y. Choi, Y.M. Ro, K.N. Plataniotis, Local color vector binary patterns from multichannel face images for face recognition, IEEE Trans. Image Process. 21 (4) (2012) 2347–2353. W. Li, C. Chen, H. Su, Q. Du, Local binary patterns and extreme learning machine for hyperspectral imagery classiﬁcation, Geosci. Remote Sensing, IEEE Trans. 53 (7) (2015) 3681–3693. N. Liu, H. Wang, Ensemble based extreme learning machine, Signal Process. Lett., IEEE 17 (8) (2010) 754–757. B.S. Manjunath, J.-R. Ohm, V.V. Vasudevan, A. Yamada, Color and texture descriptors, Circuits Syst. Video Technol., IEEE Trans. 11 (6) (2001) 703–715. T. Ojala, M. Pietikainen, D. Harwood, A comparative study of texture measures with classiﬁcation based on featured distributions, Pattern Recognit. 29 (1) (1996) 51–59. T. Ojala, M. Pietikäinen, T. Mäenpää, Multiresolution gray-scale and rotation invariant texture classiﬁcation with local binary patterns, IEEE Trans. Pattern Anal. Mach. Intell 24 (7) (2002) 971–987. A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, Int. J. Comput. Vision 42 (3) (2001) 145–175. J. Ren, X. Jiang, J. Yuan, Learning LBP structure by maximizing the conditional mutual information, Pattern Recognit. 48 (10) (2015) 3180–3190.

[40] R. Rifkin, A. Klautau, In defense of one-vs-all classiﬁcation, J. Mach. Learn. Res. 5 (2004) 101–141. [41] P. Salembier, T. Sikora, Introduction to MPEG-7: multimedia content description interface. [42] A. Samat, P. Du, S. Liu, J. Li, L. Cheng, E^2 LMs : ensemble extreme learning machines for hyperspectral image classiﬁcation, Sel. Topics Appl. Earth Observ. and Remote Sensing, IEEE J. 7 (4) (2014) 1060–1069. [43] K.E.A. van de Sande, T. Gevers, C.G.M. Snoek, Evaluating color descriptors for object and scene recognition, Pattern Anal. Mach. Intell., IEEE Trans. 32 (9) (2010) 1582–1596. [44] C.G.M. Snoek, M. Worring, A.W.M. Smeulders, Early versus late fusion in semantic video analysis, in: Proceedings of the 13th Annual ACM International Conference on Multimedia Hilton, Singapore, 2005, pp. 399–402. [45] X. Tian, L. Jiao, X. Liu, X. Zhang, Feature integration of {EODH} and Color-SIFT: application to image retrieval based on codebook, Signal Process. 29 (4) (2014) 530–545. [46] X.-L. Wang, Y.-Y. Chen, H. Zhao, B.-L. Lu, Parallelized extreme learning machine ensemble based on min-max modular network, Neurocomputing 128 (0) (2014) 31–41. [47] M. Wozniak, M. Grana, E. Corchado, A survey of multiple classiﬁer systems as hybrid systems, Inf. Fusion 16 (2014) 3–17. [48] J. Wu, J.M. Rehg, CENTRIST: a visual descriptor for scene categorization, IEEE Trans. Pattern Anal. Mach. Intell 33 (8) (2011) 1489–1501. [49] Y. Yang, S. Newsam, Spatial pyramid co-occurrence for image classiﬁcation, in:, in: Computer Vision (ICCV), 2011 IEEE International Conference on Nov., 2011, pp. 1465–1472. [50] L. Zhang, Z. Zhou, H. Li, Binary Gabor pattern: an eﬃcient and robust descriptor for texture classiﬁcation, in: Image Processing (ICIP), 2012 19th IEEE International Conference on, 2012, pp. 81–84. [51] Y. Zhao, Theories and Applications of LBP: A Survey, Advanced Intelligent Computing Theories and Applications. With Aspects of Artiﬁcial Intelligence, D.-S. Huang, Y. Gan, P. Gupta, and M.M. Gromiha, eds., Springer Berlin Heidelberg, 112–120.

Hierarchical ELM ensembles for visual descriptor fusion

Hierarchical ELM ensembles for visual descriptor fusion

Recommend Documents