Hierarchical ELM ensembles for visual descriptor fusion

Hierarchical ELM ensembles for visual descriptor fusion

Information Fusion 41 (2018) 16–24 Contents lists available at ScienceDirect Information Fusion journal homepage: www.elsevier.com/locate/inffus Hi...

2MB Sizes 1 Downloads 67 Views

Information Fusion 41 (2018) 16–24

Contents lists available at ScienceDirect

Information Fusion journal homepage: www.elsevier.com/locate/inffus

Hierarchical ELM ensembles for visual descriptor fusion Stevica Cvetkovic´ a,∗, Miloš B. Stojanovic´ b, Saša V. Nikolic´ a a b

Faculty of Electronic Engineering, University of Niš, Aleksandra Medvedeva 14, Niš 18000, Serbia College of Applied Technical Sciences Niš, Aleksandra Medvedeva 20, Niš 18000, Serbia

a r t i c l e

i n f o

Article history: Received 26 December 2016 Revised 9 June 2017 Accepted 27 July 2017 Available online 28 July 2017 Keywords: Feature fusion Extreme Learning Machine Hierarchical classifiers Scene classification

a b s t r a c t Extreme Learning Machines (ELM) have been successfully applied to variety of classification problems by utilizing a single descriptor type. However, a single descriptor may be insufficient for the visual classification task, due to the high level of intra-class variability coupled with low inter-class distance. Although several studies have investigated methods for combining multiple descriptors by ELM, they predominantly apply a simple concatenation of descriptors before classifying them. This type of descriptor fusion may impose problems of descriptor compatibility, high dimensionality and restricted accuracy. In this paper, we propose a hierarchical descriptors fusion strategy at the decision level (“late-fusion”), which relies on ELM ensembles (ELM-E). The proposed method, denoted as H-ELM-E, effectively combines multiple complementary descriptors by a two-level ELM-E based architecture, which ensures that a more informative descriptors will gain more impact on the final decision. In the first level, a separate ELM-E classifier is trained for every image descriptor. In the second level, the output scores from the previous level are aggregated into the mid-level representation which is conducted to an additional ELM-E classifier. The exhaustive experimental evaluation confirmed that the proposed hierarchical ELM-E based strategy is superior to the single-descriptor methods as well as “early fusion” of multiple descriptors, for the visual classification task. Additionally, it was shown that significant accuracy improvement is achieved by integrating ensembles of ELM as a basic classifier, instead of using a single ELM. © 2017 Elsevier B.V. All rights reserved.

1. Introduction In the last years, there have been great advances in natural scene image processing. The research was focusing both on the low-level tasks, such as denoising or segmentation, and high level ones, such as detection or classification. A variety of algorithms have been developed for the classification at the pixel level, however the problem becomes more complex at the level of the complete scene classification. The goal of the scene classification is to label an image according to a set of predefined semantic categories (e.g. forest, river, mountain, desert, etc.). It is a challenging problem, because of large variability within a given class in the sense of content, color, scales and orientations. High intra-class variability could be coupled with low inter-class distance, a problem that grows even more as finer classification is required. Research on natural scene classification has been focusing both on the use of suitable image descriptors and of appropriate classification algorithms. A variety of image texture descriptors have been proposed in the literature [13,37,50], and applied to scene classification. In order to make these descriptors more robust, it was found neces-



Corresponding author. ´ E-mail address: [email protected] (S. Cvetkovic).

http://dx.doi.org/10.1016/j.inffus.2017.07.003 1566-2535/© 2017 Elsevier B.V. All rights reserved.

sary to include additional visual cues, such as color information. It has been employed to improve the performance of scene classification algorithms due to the complementary characteristics among the color channels [43]. Although, there is an increasing amount of work on combining texture and color descriptors [4,7,17,27,45], effective fusion of descriptors by assessing their complementarity is still an open research problem in computer vision. This motivated us to explore the complementary visual information in order to boost the scene classification performance. To make image descriptors more robust, we found it necessary to simultaneously include multiple visual cues (i.e. texture, color, etc.), by using appropriate fusion strategy. The fusion process can occur at the descriptor level, or at the decision level [2,44]. While descriptor-level fusion (i.e. „early fusion“) integrates heterogeneous descriptors together into a single vector, decision-level fusion („late fusion“) operates on output classification scores of each individual descriptor and combines them into a final decision. Despite its simplicity and computational efficiency, the early fusion approach may impose problems of descriptor compatibility, high dimensionality and restricted accuracy. The basic approach to the late fusion is to use a fixed weight for each classifier score and afterwards compute a weighted sum of the scores as the final result. This assumes that all the classifiers share the same weight and is unable to consider the differences of the classifier’s individual prediction

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24

capability. Therefore, in the proposed work, we focus on the late fusion of descriptors where an additional classifier is trained to estimate the specific fusion weights for each separate descriptor. The proposed method is investigated in the context of the scene classification task. As the basic classifier, we consider a single hidden layer feedforward neural networks (SLFN), which is an alternative to the commonly used SVM [12]. Concretely, we investigate a recently introduced SLFN training algorithm, termed as Extreme Learning Machine (ELM) [20,24]. The choice of ELM classifier is due to its extremly efficient training procedure and highly accurate classification performance. The main drawback of traditional artificial neural networks and SVM is their training speed, which has been a major issue for practical applications, especially when real-time output of the system is needed. The ELM drastically increases training speed of SLFN by randomly generating input weights and biases for hidden layer nodes, instead of iteratively adjusting their parameters by commonly used gradient-based methods. The output weights of the hidden layer are then analytically computed by a least squares method. Besides minimizing the training error, ELM finds the smallest norm of output weights and hence tends to give better generalization performance than gradient-based learning algorithms, such as backpropagation. Moreover, the ELM can “naturally” handle the multi-class classification problem with the architecture of multiple output nodes equal to the number of pattern classes. This is an advantage compared to the widely used SVM method that applies one-versus-all or one-versus-one strategy to handle non-binary cases [40]. It would be highly beneficial to study possibilities for ELM integration into the heterogeneous descriptor fusion scheme, as presented in this work. The ELM has already been applied to a variety of classification-related problems including: texture classification [26], protein sequence classification [11], remote sensing image classification [15,33], landmark recognition [8,10], etc. Compared to existing machine learning techniques, the ELM is conceptually simpler and computationally more efficient while demonstrating high generalization capabilities. However, the random assignment of parameters introduces suboptimal input weights and biases into hidden layer that may result in unstable and non-optimal output. A natural way to overcome this drawback is to use an Ensemble of ELMs according to the established principles of randomized learners, such as Random Forest [5]. Several algorithms for the formation of ELM ensembles were recently proposed [9,34,42], including our Average Score Aggregation [14]. The main advantage of ensembles comes from the fact that combined outputs from several diverse learners can increase the generalization capabilities of a single classifier used in the ensemble [18]. To further improve diversity, several learner-independent techniques such as resampling, label switching, and feature space partitions, could be applied [3]. Inspired by the two previous trends of descriptor fusion and ELM ensembles, we propose to couple them in such a way which allows ELMs to directly select those descriptors that best discriminate the target classes, from a set of descriptor candidates. Our approach for descriptor fusion is hierarchical. We propose a twolevel ELM based architecture which ensures that a more informative descriptors will gain more significance in the final decision. In the first level, a separate ELM classifier is trained for every image descriptor. Then, in the second level, the output scores returned by the first level classifiers are aggregated to obtain the mid-level representation. Mid-level descriptor is then used as an input for the second level ELM classifier, to produce the final classification result. In this way we allow a second level classifier to directly favor those descriptors that best discriminate the target classes. To further improve accuracy of the method, we propose to integrate Ensembles of ELM as a basic classifier, instead of a single ELM. En-

17

sembles of ELM are proven to be able to improve classification accuracy largely, without significant time consumption [9,14]. In this work, we successfully integrated Ensemble ELMs in the proposed hierarchical ELM architecture for the scene classification task. As the main contribution of the paper we assume introduction of a novel descriptor fusion method that effectively tackles image intraclass diversity by proposing a hierarchical ELM based approach. Apart from the theoretical contribution, we performed extensive evaluation over the two public scene datasets which proved that the proposed algorithm can reach highly accurate results without computationally complex operations. A comparative evaluation demonstrates increased classification accuracy of the proposed HELM-E method compared to the accuracy when separate descriptors are used, as well as to the early fusion of descriptors (i.e. descriptor concatenation). In addition, the experiments demonstrate high level of computational efficiency of the complete scene classification pipeline. The reminder of the paper is organized as follows. Section 2 gives a brief overview of ELM and ensembles of ELMs for multi-class classification, and then introduces the proposed method for hierarchical descriptor fusion which relies on ELM ensembles. Section 3 describes the extraction of the visual descriptors used in the proposed classification scheme. Experimental results and discussion are presented in Section 4, while Section 5 draws conclusions and proposes ideas for future work. 2. Hierarchical fusion of Extreme Learning Machines (ELM) Fusion of classifiers aims to include mutually complementary individual classifiers which are characterized by high diversity and accuracy [47]. It is intuitive that increasing of diversity should lead to the better accuracy of the combined classifier, but there is no formal proof of this dependency. Brown et al. [6] noticed that we can successfully ensure diversity by independent generation of individual classifiers based on random techniques. The advantage of using ELMs in the fusion is that its diversity comes naturally from randomness in its hidden layer of neurons. Additional increase in diversity of the proposed hierarchical method is provided by integrating ensemble of ELMs as a basic classifier, instead of using a single ELM. We will first give a brief overview of ELM and ensemble of ELMs, and afterwards describe the proposed hierarchical ELMbased algorithm for heterogeneous descriptor fusion (H-ELM-E). 2.1. ELM for multiclass classification Let suppose that we have N training samples denoted as (x j , y j ), j = 1, . . . , N, where x j = [x j1 , x j2 , . . . , x jn ]T  Rn represents the j-th training sample of the dimension n, and y j = [y j1 , y j2 , . . . , y jm ]T  Rm represents the j-th training label of the dimension m, where m is the number of classes. In the context of the visual feature fusion, xj could be assumed as an image descriptor, while yj is an m dimensional binary vector of class labels, with value “1” at the position of the corresponding class, and value “0” at other positions. The output of an ELM, with L hidden neurons and activation function h(x) is defined as

 

f xj =

L 

  βi h w i · x j + b i ; j = 1 , . . . , N

(1)

i=1

where h() is a nonlinear piecewise continuous activation function, βi ∈ Rm represents the weight vector connecting the ith hidden neuron and all the output neurons, wi ∈ Rn is the weight vector connecting the ith hidden neuron and all input neurons, and bi is the threshold of the ith hidden neuron. Although sigmoid activation function is the most commonly used in practical applications,

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24

18

Algorithm 1 ELM ensemble using Average Score Aggregation [14]. Given a number of classes - m, the sigmoid activation function - h(x), the number of hidden neurons - L, and the number of ELMs in an ensemble - K (every ELM uses the same values of m, h(x), L and K): Training Input: A training set S = { (x j , y j )}| x j ∈ Rn , y j ∈ Rm , j = 1, ..., N} with N instances of n-dimensional descriptors and m-dimensional output score. for k = 1 to K 1. Generate random input weights wi and biases bi , i = 1, ..., L 2. Assign the input weights wi and biases bi to ELM(k). 3. Calculate the hidden layer output matrix Htrain (k) using the complete S, according to (4). † (k )Y using the complete S. 4. Compute β (k ) = Htrain endfor Testing Input: A test set S  = {x j }|x j ∈ Rn , j = 1, ..., N  } with N’ instances of n-dimensional descriptors. for k = 1 to K 1. Calculate the hidden layer output matrix Htest (k) using Eq. (4), for new instances from the test set S’. 2. Obtain output matrix Ytest (k ) = Htest (k )β (k ) of dimensionality N × m. endfor  Sum up all K output matrices Ytest = Kk=1 Ytest (k ). For every test instance (i.e. Ytest row), compute the class label as the index of the maximal value in the row.

other activation functions can also be applied (Gaussian, wavelet, hyperbolic tangent, etc.) [21]. According to ELMs universal approximation property, it is able to solve any regression problem with a desired accuracy, if it has enough hidden neurons and training data to learn parameters of all the hidden neurons [21,22]. In addition, ELMs can be easily adapted for classification problems [23], by predicting the class label as the index of the output node with the highest score. ELM theories also prove the classification capability of wide types of networks with random hidden neurons. It was proven that, if tuning the parameters of hidden neurons could make an ELM to approximate any target continuous function, then the ELM with random hidden layer mapping can separate arbitrary disjoint regions of any shapes [23]. In the context of the ELM theory, wi and bi can be randomly and independently assigned a priori, without considering the input data [24]. An SLFN defined with (1) has approximation capaL bilities with zero error means i=1  f i − yi  = 0, i.e. there exist βi , wi and bi such that L 

  βi h w i · x j + b i = y j , j = 1 , . . . , N

(2)

i=1

If a value of 1 is padded to xj to make it a (d + 1)-dimensional vector, then the bias can be considered as an element of the weight vector. The equivalent compact matrix form of (2) for N input samples can be written as

Hβ = Y

(3)

where H represents the hidden layer output matrix of the complete neural network, with the ith column of H representing the ith hidden neuron’s output vector in regard to inputs x1 , x2 , ..., xN .



h ( w1 · x1 + b1 ) .. H=⎣ . h ( w1 · xN + b1 )

··· .. . ···







h ( wL · x1 + bL ) h ( x1 ) .. ⎦ = ⎣ .. ⎦ . . h(wL · xN + bL ) N×L h ( xN ) (4)

⎡ T⎤ β1 β = ⎣ .. ⎦

⎡ T⎤

.

β

T L

L×m

y1 and Y = ⎣ ... ⎦ yNT N×m

of training samples, so the output weights can be analytically determined by finding the unique smallest norm least-squares solution of the linear system (3), as β = H †Y . Where H† is the Moore– Penrose generalized inverse of matrix H, and H † = H T (H H T )−1 . To improve the generalization performance and make the solution more robust, a trade-off parameter C is usually added to each diagonal element of HHT . As a result, the output of the ELM classifier is obtained as:

 

−1   I + HHT Y β = h x j HT

 

f xj = h xj

C

(6)

The predicted class label for a given test sample is the index of the output node with the highest output score. Let fi (xj ) denote the output function of the ith output node for the input sample xj . Then, the predicted class label of the sample xj is class(x j ) = argmax fi (x j ). 1≤i≤m

2.2. ELM ensembles (ELM-E) In order to overcome known drawbacks of ELM, such as unstable or non-optimal scores caused by the randomness of input weights, we will rely on ELM ensemble [9,10,19,30,46]. For aggregation of individual ELMs inside an ensemble we will apply our recently proposed Average Score Aggregation strategy [14]. It was shown to improve results of the commonly used majority voting strategy [9], when the number of ELMs is relatively small. The complete ELM ensemble algorithm with Average Score Aggregation strategy is given in Algorithm 1. It can be noticed that during the testing phase we first perform the summation of output scores obtained from the individual ELMs in an ensemble, and afterwards compute the class labels. This is opposite to the commonly used majority voting approach [9], where the class labels are first predicted by each individual ELM in an ensemble, and afterwards aggregated into a final decision. In the case of a relatively small number of individual classifiers (K< 15), voting could result in unnecessary loss of accuracy caused by “binarization” which will completely lose decimal output scores for all non-winning classes. 2.3. The proposed hierarchical ELM ensembles (H-ELM-E)

(5)

In (4), h(x j ) = [h(w1 · x j + b1 ), . . . , h(wL · x j + bL )] is the output of the hidden nodes in response to the input xj . In most cases, the number of hidden neurons is much smaller than the number

The concept behind the proposed hierarchical ELM ensembles (H-ELM-E) for descriptor fusion is presented in Fig. 1 and described in the following. In the first level our method will train several separate ELM ensembles (ELM-E), each using a different type of image descriptor. In the second level, a single ELM ensemble is trained on output scores obtained from the first level classifiers.

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24

19

Fig. 1. Overview of the proposed hierarchical ELM ensembles classification method (H-ELM-E).

This way, during the first stage each ELM ensemble will learn how to separate classes based on a single descriptor, while the second level ELM ensemble will learn an optimal combination of descriptors for a given class. The proposed scheme is presented in Fig. 1. Let the training set contains N labeled images with D descriptor vectors extracted for each image j: x(j1 ) , x(j2 ) , . . . , x(jD ) . At the first level, we will separately train D basic ELM-E classifiers, one for each descriptor. Let’s denote the outpout classification score vec

tors of D separate classifiers as f  (j1 ) , f j (2 ) , . . . , f  (jD ) . Dimensionality of every output score vector is equal to the number of classes M into which the images have to be classified. To form the input into a second level ELM-E classifier, the output probabilities of the first level classifiers are concatenated into a mid-level descriptor f j = [ f  (j1 ) , f  (j2 ) , . . . , f  (jD ) ]. The dimensionality of a new mid-level descriptor is now DM. The second level ELME classifier is trained using the previous mid-level descriptor fj as the input, while sharing the same output labels as in the first level. Input weights of the second level ELM-E are again independently assigned to random values, during the training. Output gives final image classification score vector, with the maximum score corresponding to the predicted class label. One should note that input weights for all ELM-E classifiers, in both levels of the H-ELM-E network, are randomly and independently assigned according to the ELM theory [24]. Besides that, the same training set is used for all ELM-E classifiers in both levels. In the concrete case, to take advantage of different types of image descriptors, we will generate four ELM ensembles in the first level. Each ELM ensemble will be trained with an independent descriptor described in the following section. This approach allows that each ELM ensemble in the first level learns to classify input images based on a single type of possibly “dominant” descriptor for that class. In order to learn importance of each descriptor within the class, we conducted the concatenated output scores to the second level ELM ensemble which will perform final prediction of the class label. Note that no preprocessing step has to be done on heterogeneous input image descriptors (e.g. “z-score”), since every input descriptor is first separately classified and no mixing is done at the descriptor level. This is advantage compared to early

descriptor fusion methods which require appropriate rescaling of the heterogeneous descriptors. 3. Image descriptors We will now provide a brief overview of the image descriptors which are used in the context of the proposed classification method. The descriptors were chosen according to computation speed, dimensionality, robustness and effectiveness. We experimented with two texture descriptors (described in 3.1 and 3.2) and two color descriptors (described in 3.3 and 3.4). However, the proposed hierarchical fusion method is generic enough to be used with combination of other appropriate image descriptors, where increased number of descriptors is expected to improve accuracy of classification at the price of reasonable complexity expansion. 3.1. Binary Gabor Patterns – BGP The Binary Gabor Pattern (BGP) has recently been introduced for the texture classification problem [50]. The central idea of the BGP is to combine multiple Gabor filter responses at the same pixel location and encode them in a rotation invariant manner. A 2D Gabor filter at position (x, y) for a given orientation θ , including a real and imaginary term, can be expressed as:



1 gσ , γ ,λ (x, y, θ ) = exp − 2

x

2

σ2

+

y

2

(γ σ )2



2π x λ

exp j

(7)

where x = xcosθ + ysinθ , y = −xsinθ + ycosθ , σ is the standard deviation of the Gaussian envelope which specifies the spatial width of the filter, γ is the spatial aspect ratio which determines the ellipticity of the filter (typically set to 2), and λ is the wavelength of the filter (specified in pixels). To form a robust local binary descriptor for every pixel location, the image is first convolved with n oriented Gabor filters gσ , γ , λ (x, y, θ i ) determined by discrete orientations θi = iπ /n, i = 0, . . . , n − 1. All of the n filter responses at pixel location (x, y) are first binarized and then concatenated to form a pixel’s binary

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24

20

representation denoted as BG(x, y, θ i ). This n-bit binary representation at pixel (x,y) could be encoded as an integer number, referred as rotation sensitive Binary Gabor Pattern using n orienta −1 tions: BGP  (x, y, n ) = ni=0 BG(x, y, θi ) · 2i . Inspired by [37], it was noted that the bitwise shifted values of BGP (x, y, n) represent the same pattern, but rotated at a certain angle. Therefore, to achieve the rotation invariant descriptor denoted as BGP(x, y), shifted BGP’ values should be grouped together, as described in [50]. In case of n = 8 orientations, an initial set of 28 = 256 values of BGP (x, y, 8) will be reduced to 36 rotation invariant BGP(x, y) values. For n = 6 there will be 14 BGP(x, y) values, while for n = 4 will be 6 BGP(x, y) values. After a BGP(x, y) is computed at every pixel location, a global image descriptor is computed as a L1 normalized histogram. Note that for a grayscale image, the two BGP descriptors are extracted for the pair of even and odd Gabor filter (real and imaginary term in Eq. (7)), and concatenated to form a global image descriptor. To exploit additional characteristics of a color image at multiple scales, we used the approach presented in [14]. Concretely, a color image is first converted into YCbCr color space, and an even and odd BGP descriptor are extracted for every color channel separately. We use the following parameters: nY = 6 orientations for the Y channel (28-dimensional descriptor), nCb = nCr = 4 orientations for Cb and Cr channels (12-dimensional for Cb and 12dimensional for Cr). This results in a robust 52 dimensional image descriptor of a color image. To include details at multiple scales, BGP descriptors are extracted over the original image and two down-scaled images, and concatenated into a final 3 × 52 = 156 dimensional descriptor.

3.2. Local Binary Patterns – LBP Local Binary Pattern (LBP) is a popular visual descriptor computed using the LBP operator that captures local appearance around a pixel. It was introduced in [36] for the texture classification problem, and extended to general neighborhood sizes and rotation invariance in [37]. Since then, LBP has been extended and applied to variety of applications [25,51]. The local LBP descriptor centered at pixel fc is an array of 8 bits, with one bit encoding each of the pixels in the 3 × 3 neighborhood. Each encoded neighbor bit is first set to ‘1’ or ‘0’, depending on whether the intensity of the corresponding pixel is greater than the intensity of the central pixel. To form a binary array, neighbors are scanned in anti-clockwise order, starting from the one at the most right position. The binary array is then converted to a decimal number, representing a LBP value of the central pixel. If we denote nearest neighbors of the central pixel fc as fi , i = 0, . . . , 7, a LBP descriptor could be mathematically computed as:

LBPc =

7 

S ( fi − fc )2i ; where S ( fi − fc ) = 1 i f fi > fc , and

3.3. Color Layout Descriptor - CLD The Color Layout Descriptor (CLD) has been designed to compactly represent the spatial color layout of an image [35,41]. It is obtained by extracting local representative colors over nonoverlapping image blocks, and compressing them using a 2D Discrete Cosine Transformation (2D-DCT). The descriptor is characterized by efficient extraction, compact representation and invariance to resolution changes. Several studies have shown its effectiveness for image retrieval or classification [16]. We extracted CLD in a slightly modified way, directly on YCbCr color channels, without any quantization step at the end. For every color channel c ∈ {Y, Cb, Cr} the extraction process starts with a spatial partitioning step, where the color channel is divided into 8 × 8 = 64 non-overlapping blocks, to guarantee resolution invariance. Then, a single representative block value bc (i, j) is computed by the simple averaging of pixels inside the block(i, j ), i = 0, 1, . . . , 7; j = 0, 1, . . . , 7. It provides sufficient accuracy with minimal computation costs. After this, 64 block representative values bc (i, j) for the channel c are passed to a 2D-DCT. Let us denote by Fc (u, v) a DCT coefficient over the color channel c:

Fc (u, v ) = C (u )C (v )

 √ C (u ) =

1/ 8 1/2

7  7 

bc (i, j )cos

(2i + 1)uπ

i=0 j=0

, u=0 ; C (v ) = , u = 0

16

 √

1/ 8 1/2

cos , ,

(2 j + 1)vπ

v=0 v = 0

16 (9)

The resulting DCT coefficients Fc (u, v) are reordered in accordance with the zig-zag pattern. Finally, the most informative DCT coefficients are retained for every color channel c: CLDc = [Fc (0, 0 ), Fc (0, 1 ), Fc (1, 0 ), Fc (2, 0 ), Fc (1, 1 ), Fc (0, 2 ), Fc (0, 3 ), Fc (1, 2 ), Fc (2, 1 ), Fc (3, 0 ), . . .]. The final CLD descriptor is formed by the concatenation of CLD-s extracted for every color channel: CLD = [CLDY , CLDCb , CLDCr ] It was empirically shown that 6 to 10 dominant coefficients per channel are sufficient for image classification, so we will use a 22dimensional CLD representation (10 coefficients for Y, 6 for the Cb, and 6 for the Cr color channel). It should be noted that the CLD is scale invariant due to block-based processing, so applying multiscale extraction as in the case of Gabor filters is not needed. 3.4. RGB histogram - RGB This is a baseline color descriptor representing the global color distribution in the RGB (Red, Green, Blue) color space [43,51]. The RGB histogram is constructed by concatenating the three normalized histograms from the R, G and B channels. We used 16 bins for computing histogram of each separate channel, and concatenated them into final 48-dimensional representation. Experiments with increased number of bins, has demonstrated no gain in accuracy. 4. Experimental evaluation

i=0

S( fi − fc ) = 0 i f fi ≤ fc .

(8)

A LBP descriptor of the complete image is formed as a histogram of LBP values computed for every pixel of the image. Although there are 256 = 2^8 possible basic LBP patterns, this could be reduced into a smaller number of 58 rotation invariant patterns, as proposed in [37]. To form the final multi-channel LBP descriptor which exploits color information of the image, we extracted LBP descriptor over all color channels and concatenated them into a single vector. Since default RGB color space shows large correlation among color channels, we used YCbCr color space which could be more effective for the image classification.

4.1. Datasets To evaluate the performance of the proposed approach we carried out a number of experiments using the two publicly available scene datasets. We first considered the recently introduced Landuse21 dataset [49]. It consists of 2100 aerial images extracted from a high-resolution aerial image, and labeled as belonging to 21 land use classes. For each of the 21 classes, 100 images are available at resolution of 256 × 256 pixels, with a large variation in terms of texture and color. An example image for each of the 21 classes is shown in Fig. 2. In recent years, many researchers have used this dataset, allowing for an extensive comparison of results

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24

21

Fig. 2. An example image for each of the classes from the test datasets: a) Landuse21 dataset [49], b) 8-scenes dataset [38].

Table 1 Classification accuracy (in %) of the proposed fusion method H-ELM-E, compared to results when using separate descriptors or commonly used late fusion approach. Results are given for both test datasets (Landuse21 and 8-scenes). Standard deviation is given in brackets. Descriptor

Classifier

Accuracy Landuse21 dataset

Accuracy 8-scenes dataset

BGP LBP CLD RGB BGP+LBP+CLD+RGB (early fusion) BGP+LBP+CLD+RGB (early fusion, PCA to 100 dim) BGP, LBP, CLD, RGB (late fusion) BGP, LBP, CLD, RGB (late fusion)

ELM-E (1-level) ELM-E (1-level) ELM-E (1-level) ELM-E (1-level) ELM-E (1-level) L = 10 0 0, K = 10 ELM-E (1-level) L = 10 0 0, K = 10 H-ELM-E (2-levels) L = 10 0 0, K = 10 H-ELM-E (2-levels) L = 20 0 0, K = 10

82.64 (±1.65) 83.42 (±1.76) 48.34 (±1.89) 68.54 (±1.52) 88.41 (±1.54) 83.12 (±1.59) 90.54 (±1.43) 91.45 (±1.25)

79.65 (±0.59) 80.34 (±0.73) 56.42 (±1.02) 47.62 (±0.86) 83.52 (±0.43) 81.43 (±0.62) 85.63 (±0.57) 86.62 (±1.12)

with the literature. We additionally performed tests on the challenging 8-scenes dataset [38]. The dataset consists of 2688 color images of outdoor scenes classified in 8 categories: coast, mountain, forest, open country, street, inside city, tall buildings and highways. Fig. 2 shows example images from the two datasets. To make our tests comparable to other published results, we randomly partitioned each class into training and testing subsets, as common in the literature. By default, in the case of Landuse21 dataset, we randomly partitioned each class into 80% training and 20% testing images. For 8-scenes dataset we used 100 images per class for training, and rest of the images for testing. The experiments were repeated 50 times using different random dataset partitions to achieve the robustness of results. The proposed method was implemented in MATLAB and experiments have been all carried out on a mainstream computer Intel Core i7 3.2 GHz. In the following subsections we give results of experiments on both datasets. 4.2. Results The first set of experiments was conducted in order to determine optimal parameter settings for ELM-E as a basic classifier unit in the proposed H-ELM-E classification scheme. We measured the classification accuracy depending on different ELM-E parameter configurations, by varying a number of hidden neurons (L) per ELM in an ELM-E, and number of ELMs in an ensemble (K). For all the following tests, ELM-E parameter C was fixed to value C = 0.1. The results presented in Figs. 3 and 4 demonstrate that the increased number of neurons (L) as well as combination of several ELMs in an ensemble (K>1), could significantly improve the accuracy. This stressed the importance of using ELM-E instead of a single ELM as a basic classifier unit in the proposed hierarchical fusion method. Additionally, a single ELM could expose high instability of results caused by the randomness of input weights, while ELM-E is a much stable classifier. Significant result improvements

are noticeable at L= 1000 and K = 5. While additional increasing of L and K could further improve the results, it is noticeable saturation for L = 1000 and K = 10, after which the improvements are negligible. Therefore, for further tests we could use these values of parameters, as an optimal compromise between complexity and accuracy. To test the impact of multiple feature fusion using the proposed hierarchical method (H-ELM-E), we compared it to the classification accuracy when separate descriptors are used, as well as to the commonly used early fusion of descriptors (i.e. descriptor concatenation). For this test we fixed number of ELMs in an ensemble to K = 10. The results presented in Table 1 imply that classification accuracy can be significantly improved by integrating multiple complementary descriptors, using the appropriate fusion strategy. An important result is that the proposed H-ELM-E method could improve accuracy of the commonly used early fusion strategy by more than 3%. We performed an additional experiment to verify impact of PCA technique for dimensionality reduction in the process of early fusion (Table 1). The typical reason for PCA application is because not all classifiers can handle well high dimensional data and correlated variables. However, Neural Networks (e.g. ELM ensembles) proves itself to be successful in dealing with high dimensional and correlated image descriptors [1,29], where it is typically left to the network to simultaneously learn the parameters of the hidden units, along with how to combine them into final decision score. Note that in all experiments with early fusion, we normalize the concatenated descriptor vector to zero mean and unit variance (“zscore”), before applying the classifier. This has effect of averaging out the noise, by penalizing large values which could have a disproportionate impact on classification. Our experimental results verify that the accuracy is reduced after applying PCA over concatenated descriptors, for this particular task (Table 1). At the same time, the reduced descriptor dimensionality will led to improved efficiency of the calculation.

22

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24

Fig. 3. Classification accuracy of the H-ELM-E on “Landuse21” dataset, depending on a number of hidden neurons (L) per ELM, and number of ELMs (K) in an ensemble ELM-E.

Fig. 4. Classification accuracy of the H-ELM-E on “8-scenes” dataset, depending on a number of hidden neurons (L) per ELM, and number of ELMs (K) in an ensemble ELM-E.

Next, the accuracy of the proposed H-ELM-E method is compared to the relevant results published in the literature. Additional tests measuring the influence of different training sample sizes on the accuracy are also included in Tables 2 and 3. For the Landuse21 dataset, the training size is varied within 80, 50 and 20 images per class, while for the 8-scenes dataset, the training sample size is varied within 20 0, 10 0 and 50 samples per class. Tables 1–3 show the generally high accuracy of our method (H-ELM-E).

Finally, we measured and compared time performance of the proposed method implemented in MATLAB, on a mainstream Intel Core i7 3.2 GHz. We measured the total training time for 1680 training images and total testing time for 420 images of the Landuse21 dataset. In Table 4 we gave time efficiency results of the proposed late fusion method (H-ELM-E), as well as an early fusion method (ELM-E). Results in Table 4 demonstrate high efficiency of H-ELM-E with high stability to the number of hidden neurons

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24 Table 2 Comparing accuracy of the proposed method (H-ELM-E) to the state of the art results on Landuse21 dataset, depending on the number of training images per class. Method

Number of training images per class

SPM [31] SPCK++ [49] MCBGP [14] MCMI [39] mCENTRIST [41] H-ELM-E (L = 10 0 0, K = 10) H-ELM-E (L = 20 0 0, K = 10)

80

50

20

74.00 77.38 86.52 88.20 89.90 90.54 91.45

– – 82.11 – – 86.76 86.81

– – 70.29 – – 75.93 76.15

Number of training images per class

LCVBP [32] PM [28] MCBGP [14] CENTRIST [48] H-ELM-E (L = 10 0 0, K = 10) H-ELM-E (L = 20 0 0, K = 10)

200

100

50

– – 84.87 – 87.89 87.92

76.00 82.00 82.31 86.22 85.63 86.62

– – 78.97 – 82.53 82.64

Table 4 Time efficiency of ELM-E based methods on the Landuse21 dataset, depending on parameters. Classifier

Parameters

Training time (s)

ELM-E (early fusion) H-ELM-E (late fusion) H-ELM-E (late fusion) H-ELM-E (late fusion)

L = 10 0 0, L = 10 0 0, L = 20 0 0, L = 20 0 0,

1.3 5.6 17.3 36.2

K = 10 K = 10 K = 10 K = 20

While operating in real-time, our method achieves accuracy comparable to the state-of-the-art results. Although the tests outlined in this paper were conducted using the four descriptors in visible spectrum, the method is generic enough to be extended to other image descriptors and image modalities, such as Near InfraRed images – NIR [4]. Although we demonstrated high-level results using only manually engineered image descriptors, we believe that further improvements could be achieved by using descriptors learned in a supervised manner. Therefore, our primary research plan is to investigate the possibilities for integration of deep learned features [29] into our hierarchical classification method. References

Table 3 Comparing accuracy of the proposed method (H-ELM-E) to the state of the art results on 8-scenes dataset, depending on the number of training images per class. Method

23

Testing time (s) 0.12 0.43 0.91 1.68

and parameter C = 0.1, as long as the number of hidden neurons is large enough. Hence, there is no need for exhaustive search of the ELM-E optimal parameters during the training phase, what improves efficiency of the method. In addition, ELM scales efficiently to large number of samples [23]. As expected, “early fusion” method achieves better efficiency than the proposed hierarchical method, at the cost of reduced accuracy. However, when designing a machine learning system, it is always a matter of compromise between efficiency and accuracy. Additional improvement of the H-ELM-E efficiency can be achieved by using a parallel GPU/CPU implementation [1]. Note that parallelization of an ELM-E ensemble is straightforward [20], since it uses multiple instances of the same ELM architecture. 5. Conclusion We presented a hierarchical method for visual descriptor fusion which reaches highly accurate results for scene classification task. The experimental evaluation demonstrated the highest effectiveness and efficiency of the method. The high quality results of the proposed method is mostly due to three factors: 1) Multiple complementary image descriptors can successfully capture a variety of image patterns inside a class; 2) ELM ensembles within the proposed hierarchical feature fusion method demonstrate powerful classification capabilities without time consuming operations. 3) Late fusion of classification scores by second level ELM-E is able to learn importance of each separate descriptor, and predict final class with high accuracy.

[1] A. Akusok, K.M. Bjork, Y. Miche, A. Lendasse, High-performance extreme learning machines: a complete toolbox for big data applications, IEEE Access 3 (2015) 1011–1025. [2] P.K. Atrey, M.A. Hossain, A. El Saddik, M.S. Kankanhalli, Multimodal fusion for multimedia analysis: a survey, Multimed. Syst. 16 (6) (2010) 345–379. [3] B. Ayerdi, M. Grana, Hybrid extreme rotation forest, Neural Networks 52 (0) (2014) 33–42. [4] A. Bosch, A. Zisserman, X. Muñoz, Scene classification using a hybrid generative/discriminative approach, IEEE Trans. Pattern Anal. Mach. Intell 30 (4) (2008) 712–727. [5] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32. [6] G. Brown, J. Wyatt, R. Harris, X. Yao, Diversity creation methods: a survey and categorisation, Inf. Fusion 6 (1) (2005) 5–20. [7] M. Brown, S. Susstrunk, Multi-spectral SIFT for scene category recognition, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on Jun., 2011, pp. 177–184. [8] J. Cao, T. Chen, J. Fan, Fast online learning algorithm for landmark recognition based on BoW framework, in: 2014 IEEE 9th Conference on Industrial Electronics and Applications (ICIEA) Jun., 2014, pp. 1163–1168. [9] J. Cao, Z. Lin, G.-B. Huang, N. Liu, Voting based extreme learning machine, Inf. Sci. 185 (1) (2012) 66–77. [10] J. Cao, C. Tao, F. Jiayuan, Landmark recognition with compact BoW histogram and ensemble ELM, Multimed. Tools Appl. (2015) 1–19. [11] J. Cao, L. Xiong, Protein sequence classification with improved extreme learning machine algorithms, BioMed Res. Int. (2014) 1–12. [12] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 27:1–27:27. [13] M. Crosier, L.D. Griffin, Using basic image features for texture classification, Int. J. Comput. Vision 88 (3) (2010) 447–460. [14] S. Cvetkovic, M.B. Stojanovic, S.V. Nikolic, Multi-channel descriptors and ensemble of Extreme Learning Machines for classification of remote sensing images, Signal Process. Image Commun. 39 (Part A) (2015) 111–120. [15] P. Du, A. Samata, P. Gamba, X. Xie, Polarimetric SAR image classification by boosted multiple-kernel extreme learning machines with polarimetric and spatial features, Int. J. Remote Sensing 35 (23) (2014) 7978–7990. [16] H. Eidenberger, Statistical analysis of content-based MPEG-7 descriptors for image retrieval, Multimedia Syst. 10 (2) (2004) 84–97. [17] H. Ghassemian, A review of remote sensing image fusion methods, Inf. Fusion 32 (Part A) (2016) 75–89. [18] P.M. Granitto, P.F. Verdes, H.A. Ceccatto, Neural network ensembles: evaluation of aggregation algorithms, Artif. Intell. 163 (2) (2005) 139–162. [19] M. van Heeswijk, Y. Miche, E. Oja, A. Lendasse, GPU-accelerated and parallelized {ELM} ensembles for large-scale regression, Neurocomputing 74 (16) (2011) 2430–2437. [20] G. Huang, G.-B. Huang, S. Song, K. You, Trends in extreme learning machines: a review, Neural Netw. 61 (2015) 32–48. [21] G.-B. Huang, What are extreme learning machines? Filling the gap between Frank Rosenblatt’s Dream and John von Neumann’s puzzle, Cognit. Comput. 7 (3) (2015) 263–278. [22] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, Trans. Neur. Netw. 17 (4) (2006) 879–892. [23] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, systems, man, and cybernetics, Part B, IEEE Trans. 42 (2) (2012) 513–529. [24] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1-3) (2006) 489–501. [25] S. ul Hussain, B. Triggs, Visual recognition using local quantized patterns, in: A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, C. Schmid (Eds.), Computer Vision – ECCV 2012: 12th European Conference On Computer Vision, Florence, Italy, 2012, pp. 716–729. [26] Y. Kaya, L. Kayci, R. Tekin, O.F. Ertugrul, Evaluation of texture features for automatic detecting butterfly species using extreme learning machine, J. Exp. Theor. Artif. Intell. 26 (2) (2014) 267–281. [27] F.S. Khan, R.M. Anwer, J. van de Weijer, M. Felsberg, J. Laaksonen, Compact color-texture description for texture classification, Pattern Recognit. Lett. 51 (0) (2015) 16–22. [28] F.S. Khan, J. van de Weijer, S. Ali, M. Felsberg, Evaluating the impact of color on texture recognition, in: R. Wilson, E. Hancock, A. Bors, W. Smith (Eds.),

24

[29]

[30] [31]

[32]

[33]

[34] [35] [36]

[37]

[38] [39]

S. Cvetkovi´c et al. / Information Fusion 41 (2018) 16–24 Computer Analysis of Images and Patterns: 15th International Conference, CAIP 2013, York, UK, August 27-29, 2013, Proceedings, Part I, Springer Berlin Heidelberg, 2013, pp. 154–162. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems 25, F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, eds., Curran Associates, Inc., 1097–1105. Y. Lan, Y.C. Soh, G.-B. Huang, Ensemble of online sequential extreme learning machine, Neurocomputing 72 (13-15) (2009) 3391–3395. S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, 2006, pp. 2169–2178. S.H. Lee, J.Y. Choi, Y.M. Ro, K.N. Plataniotis, Local color vector binary patterns from multichannel face images for face recognition, IEEE Trans. Image Process. 21 (4) (2012) 2347–2353. W. Li, C. Chen, H. Su, Q. Du, Local binary patterns and extreme learning machine for hyperspectral imagery classification, Geosci. Remote Sensing, IEEE Trans. 53 (7) (2015) 3681–3693. N. Liu, H. Wang, Ensemble based extreme learning machine, Signal Process. Lett., IEEE 17 (8) (2010) 754–757. B.S. Manjunath, J.-R. Ohm, V.V. Vasudevan, A. Yamada, Color and texture descriptors, Circuits Syst. Video Technol., IEEE Trans. 11 (6) (2001) 703–715. T. Ojala, M. Pietikainen, D. Harwood, A comparative study of texture measures with classification based on featured distributions, Pattern Recognit. 29 (1) (1996) 51–59. T. Ojala, M. Pietikäinen, T. Mäenpää, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE Trans. Pattern Anal. Mach. Intell 24 (7) (2002) 971–987. A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, Int. J. Comput. Vision 42 (3) (2001) 145–175. J. Ren, X. Jiang, J. Yuan, Learning LBP structure by maximizing the conditional mutual information, Pattern Recognit. 48 (10) (2015) 3180–3190.

[40] R. Rifkin, A. Klautau, In defense of one-vs-all classification, J. Mach. Learn. Res. 5 (2004) 101–141. [41] P. Salembier, T. Sikora, Introduction to MPEG-7: multimedia content description interface. [42] A. Samat, P. Du, S. Liu, J. Li, L. Cheng, E^2 LMs : ensemble extreme learning machines for hyperspectral image classification, Sel. Topics Appl. Earth Observ. and Remote Sensing, IEEE J. 7 (4) (2014) 1060–1069. [43] K.E.A. van de Sande, T. Gevers, C.G.M. Snoek, Evaluating color descriptors for object and scene recognition, Pattern Anal. Mach. Intell., IEEE Trans. 32 (9) (2010) 1582–1596. [44] C.G.M. Snoek, M. Worring, A.W.M. Smeulders, Early versus late fusion in semantic video analysis, in: Proceedings of the 13th Annual ACM International Conference on Multimedia Hilton, Singapore, 2005, pp. 399–402. [45] X. Tian, L. Jiao, X. Liu, X. Zhang, Feature integration of {EODH} and Color-SIFT: application to image retrieval based on codebook, Signal Process. 29 (4) (2014) 530–545. [46] X.-L. Wang, Y.-Y. Chen, H. Zhao, B.-L. Lu, Parallelized extreme learning machine ensemble based on min-max modular network, Neurocomputing 128 (0) (2014) 31–41. [47] M. Wozniak, M. Grana, E. Corchado, A survey of multiple classifier systems as hybrid systems, Inf. Fusion 16 (2014) 3–17. [48] J. Wu, J.M. Rehg, CENTRIST: a visual descriptor for scene categorization, IEEE Trans. Pattern Anal. Mach. Intell 33 (8) (2011) 1489–1501. [49] Y. Yang, S. Newsam, Spatial pyramid co-occurrence for image classification, in:, in: Computer Vision (ICCV), 2011 IEEE International Conference on Nov., 2011, pp. 1465–1472. [50] L. Zhang, Z. Zhou, H. Li, Binary Gabor pattern: an efficient and robust descriptor for texture classification, in: Image Processing (ICIP), 2012 19th IEEE International Conference on, 2012, pp. 81–84. [51] Y. Zhao, Theories and Applications of LBP: A Survey, Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence, D.-S. Huang, Y. Gan, P. Gupta, and M.M. Gromiha, eds., Springer Berlin Heidelberg, 112–120.