Geometrically invariant color, shape and texture features for object recognition using multiple kernel learning classification approach

Geometrically invariant color, shape and texture features for object recognition using multiple kernel learning classification approach

Information Sciences 484 (2019) 135–152 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins...

1MB Sizes 0 Downloads 93 Views

Information Sciences 484 (2019) 135–152

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Geometrically invariant color, shape and texture features for object recognition using multiple kernel learning classification approach Chandan Singh∗, Jaspreet Singh Department of Computer Science, Punjabi University, Patiala 147002, India

a r t i c l e

i n f o

Article history: Received 5 July 2018 Revised 10 December 2018 Accepted 24 January 2019 Available online 25 January 2019 Keywords: Texture descriptor Rotation invariant Multi-channel Zernike moments Multiple kernel learning

a b s t r a c t The geometrically invariant image descriptors are very important for recognizing objects of arbitrary shapes and orientations. We propose a framework for the fusion of the geometrically invariant descriptors representing color, shape, and texture for the recognition of color objects using multiple kernel learning (MKL) approach. To describe texture of color images, we propose an effective rotation invariant texture descriptor which is based on the Zernike moments (ZMs) of the gradient of the color images, referred to as the GZMs. For the shape features, we use the ZMs of the intensity component of a color image and also use multi-channel ZMs (MZMs) which have proven to be superior in performance than the quaternion ZMs (QZMs). For comparative performance analysis, rotation invariants of the QZMs (RQZMs) are also considered. Since the color histograms (CH) are known to be very effective color descriptors, we consider them for representing color. The five sets of features – CH, ZMs, GZMs, MZMs, and RQZMs are invariant to translation, rotation, and scale. The fusion of color, shape and texture features in different combinations using the MKL approach is shown to provide very high recognition rates on PASCAL VOC 2005, Soccer, SIMPLIcity, Flower, and Caltech-101 datasets. © 2019 Elsevier Inc. All rights reserved.

1. Introduction Visual object recognition is one of the key research areas in numerous digital image processing and computer vision applications. The task of visual object recognition is to correctly recognize the class to which a given test object belongs. However, recognizing correctly the class of a given object has been a challenging task due to the high intra-class variability and distortions in the images caused by various geometric and photometric changes such as rotation, scale, translation, occlusion, illumination, and noise. The challenges mentioned above can be classified into three categories to discuss their prevailing solutions better. In the first category, the challenges lie in dealing with the high intra-class variability. An obvious solution to this issue is to design the feature descriptors which are robust to variations present in the same class. However, not all the descriptors have the same discriminative power for all classes present in the dataset. For example, the descriptor should be invariant to change in color for the car class whereas the color information is essential for distinguishing the horse class from zebra class. Thus, a single feature descriptor is not enough to represent the wide range of object classes in a given dataset. A widely accepted solution to this issue is to combine the diverse and complementary set of low-level ∗

Corresponding author. E-mail addresses: [email protected] (C. Singh), [email protected] (J. Singh).

https://doi.org/10.1016/j.ins.2019.01.058 0020-0255/© 2019 Elsevier Inc. All rights reserved.

136

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

features based on the information describing color, shape, edge, and texture. In the second category, the challenges pertain to geometric and photometric changes. To cope with these issues, it is essential to design the feature descriptors which are invariant to these changes. There are various global feature descriptors which are invariant to geometric and photometric changes and also provide high recognition performance under these variations. The third category of challenges pertains to combining the feature descriptors originating from the different modalities realizing for better performance of the fusion of the low-level features. Most of the existing state-of-the-art fusion methods combine them as a weighted concatenation of the features without considering the interaction of the modalities. This approach usually does not take into account the discriminative aspects of the different types of descriptor. As a result, the fused features do not provide significant improvements as compared with the performance of the individual descriptors. In general, the image feature descriptors are divided into two categories - global and local. The global feature descriptors represent the holistic view of an image and most of the approaches provide global descriptors which are robust to geometric and photometric changes, and also are robust to noise. However, they are susceptible to partial occlusion and cluttered background. On the other side, the local feature descriptors represent local regions around given key-points. Unlike the global descriptors, the local feature descriptors are robust to partial occlusion and cluttered background. However, they are highly prone to noise even under mild condition. Recently, deep learning based approaches have become very attractive in numerous computer vision applications due to their better performance as compared to the approaches based on hand-crafted features. Basically, deep learning is a sub-class of a machine learning approach which tries to mimic the functioning of the human brain. Convolutional neural networks (CNNs) are among the most powerful approaches of deep learning methods which are used for image classification. Generally, CNNs consist of multiple layers designed in a hierarchical manner which help to learn the high-level abstractions in the data by using the deep hierarchical architecture. Some of the most powerful CNNs models which have become the baseline models for the application of image classification are AlexNet [16], Clarifai [46], VGG [29], and GoogLeNet [38]. It is well-known that the deep neural network (DNN)-based algorithms achieve significantly better accuracy as compared with the shallow learning techniques. However, due to the deep architecture of the DNN, these algorithms consist of millions of parameters which are learned during the training phase. A major issue with DNN-based algorithms is the requirement of a large amount of training data for their high performance because the DNN-based algorithms have to learn millions of parameters for its robustness and generalization. For this purpose, these approaches need a large amount of training data and high computing power. Although the various state-of-the-art CNNs models are invariant to translation and scale, they are highly sensitive to rotation [14]. A most commonly used solution to resolve the problem of rotation is to augment training data by adding rotated images. However, data augmentation is not always a feasible solution for large datasets. Moreover, by adding more data to the existing training dataset, the size of the data is increased and the system needs retraining on the augmented dataset. Keeping these issues in view, the hand-crafted features are still in demand in many applications where the training data is limited or in situations where many variations in data in the test sets, such as the rotation and noise, are likely to occur. Therefore, the purpose of the proposed work is to develop hand-crafted features that can achieve high recognition rates for color images using the multiple kernel learning (MKL) classification techniques. The proposed global descriptors using orthogonal rotation invariant moments (ORIMs) represent shape, texture, and color and are invariant to geometric and photometric changes and are also robust to noise. We resort to MKL technique to effectively fuse the features to extract maximum benefits of the discriminative powers of the descriptors for the classification task. The rest of the paper is organized as follows. In Section 2, related work is introduced. The existing ORIMs-based approaches are discussed in Section 3 where the ZMs are considered as the ORIMs representing the whole class of the ORIMs. Since the computational framework for all ORIMs is similar, by taking a particular case will not affect its generality, as the performance of all ORIMs is almost similar. The proposed method is explained in Section 4 which includes a discussion on the use of the MKL approach for classification. Section 5 illustrates the experimental analysis followed by conclusions in Section 6. 2. Related work Among the various global descriptors, moments are the most prominent global shape descriptors which are widely used for various image processing and computer vision applications such as image classification, object recognition, image retrieval, super-resolution, denoising, segmentation, watermarking, etc. Broadly, moments are further divided into two categories: orthogonal and non-orthogonal. Between these two, orthogonal rotation invariant moments (ORIMs) are used most commonly due to their several useful characteristics such as minimum information redundancy and invariance to rotation. Further, they can be made invariant to translation and scale after some geometric transformation. Several moments such as the Zernike moments (ZMs) [6], pseudo-ZMs (PZMs) [7], orthogonal Fourier-Mellin moments (OFMMs) [47], and BesselFourier moments (BFMs) [45] which belong to the class of ORIMs have been successfully applied to image classification and object recognition tasks. Various invariants of ORIMs have been proposed in the literature to deal with the problems of rotation, scale, and translation, and the effectiveness of these invariants have been investigated. Initially, these moments were proposed for gray-scale images. Nowadays, color images are used in many applications due to their more informative power as compared to shape and texture, as color can be used as an important attribute to enhance the discriminative power of the descriptors. This has been possible due to the rapid advancement in technology in producing low-cost computer hardware and also due to cheaper acquisition, transmission and storage capabilities. The advent of low cost and highly powerful

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

137

smartphones and mobile devices has led to the use of color images manifold. The problem of object recognition and image classification is one of the areas where color images are replacing gray-scale images. To incorporate the color cue in moment-based descriptors, recently, quaternion algebra has been used and quaternion moments have been proposed. The purpose behind introducing quaternion moments is to represent both shape and color features at the feature extraction stage. To achieve this objective, Chen et al. [4] have proposed quaternion ZMs (QZMs) and its invariants for color image recognition and template matching. Guo et al. [11] have proposed the quaternion complex moments (QCMs) which are defined in the cartesian coordinate system. Shao et al. [28] have introduced the quaternion Bessel-Fourier moments (QBFMs) and their invariants based on the phase information of a color image. Xiang-yang et al. [44] have proposed the quaternion radial harmonic Fourier moments and their invariants for color image retrieval. Chen et al. [5] have extended the quaternion theory of moments for rotational moments, radial moments, OFMMs, and PZMs. They have also proposed their invariants on rotation, scale, and translation. Karakasis et al. [15] have developed a unified approach for various quaternion moments and derived invariants for polar, radial, and discrete moments. Singh and Singh [32] have proposed the quaternion generalized Chebyshev–Fourier and pseudo–Jacobi–Fourier moments and also investigated the effect of the free parameter, which accounts for their generalized behavior, for better color image reconstruction and object recognition. Wang et al. [41] have developed quaternion polar harmonic Fourier moments for color images and shown its high performance under noisy condition. Recently, the multi-channel moments have been proposed which are computed by concatenating the moments of red, green and blue channels of a color image [31]. It is shown that the multi-channel moments perform significantly better than the quaternion moments in color object recognition problems which are faster in speed as well. The quaternion moments were developed with the view that they would provide inter-channel dependency to provide the fusion of shape and color for their better performance. However, their performance was found to be at the lower side than the multi-channel moments. The basic purpose of the development of the quaternion orthogonal rotation invariant moments (QORIMs) and multichannel orthogonal rotation invariant moments (MORIMs) is to take the advantage of both shape and color information in a holistic way. It is a well-known fact that the multiple cues such as the shape, color, and texture significantly improve the object recognition performance in comparison with the single cue. There are a number of approaches which represent both shape and color information. Prominent among them is the SIFT descriptor. The SIFT descriptor [24] which was originally developed for gray images has been extended to color SIFT (C-SIFT) [39] to embed shape and color. Some of the highly used variants of SIFT descriptor for color images are HSV-SIFT, HUE-SIFT, OpponentSIFT, rgSIFT, and RGB-SIFT. Bay et al. [1] have proposed the Speeded Up Robust Feature (SURF) descriptor which is computationally fast and provides better classification accuracy than SIFT. Ojla et al. [26] developed the local binary pattern (LBP) which is a very powerful local texture descriptor. The LBP operator has also been successfully applied to various computer vision and pattern recognition applications. In addition to being very effective texture descriptor, LBP is simple and fast to compute. Consequently, it motivated several researchers to further explore numerous variants of LBP for different computer vision applications. Some of the most commonly used variants of LBP are multispectral LBP (MSLBP) [25], local color vector binary patterns (LCVBPs) [19], orthogonal combination of local binary patterns (OC-LBP) [48], quaternion local ranking binary patterns (QLRBP) [17], local similarity pattern (CLSP) [20], multichannel adder and decoder-based LBPs (MDLBPs) [8], and local binary patterns for color images (LBPC) [34]. Srivastava and Khare [37] have proposed an approach for image retrieval which integrates LBP with Legendre moments at multiple resolutions of the wavelet decomposition of an image. Liu et al. [23] proposed a new technique for the fusion of color histogram and LBP features for the application of color image retrieval and image classification. Piras and Giacinto [27] have provided a survey of various information fusion techniques for content-based image retrieval which analyses various fusion methods employing shape, texture, and color. The descriptors mentioned above are very effective when the images do not undergo global geometric transformation such as the rotation which is a common phenomenon in many real applications. For dealing with the problem of rotation and other global geometric transformations, the orthogonal rotation invariant moments (ORIMs) are very effective. Not only do they deal efficiently with rotation, but they are also robust to image noise because they are obtained as an integration process. Since the ORIMs were originally developed for the gray images, they represent the shape information. Recently, quaternion ORIMs (QORIMs) [4,5,11,15,28,32,41,44] and multi-channel ORIMs (MORIMs) [31] have been developed to deal with the shape and color information in a unified way. The MORIMs have proven to be better descriptors than their quaternion counterparts in representing shape and color. In this paper, we develop a framework to derive and combine features from the three different modalities which represent object color, shape and texture. The multiple kernel learning (MKL) method is used to combine the three feature cues to enhance the object recognition and classification performance. The MKL is one of the most popular methods used in computer vision to linearly combine the similarity functions between images to yield the improved classification performance. We focus on the use of the global descriptors which are invariant to global changes in the image such as the rotation, translation and scale. Also, many global descriptors are robust to image noise and occlusion. They are very efficient with regard to feature extraction and provide low-dimensional feature vectors. Further, their derivation is straightforward. Among the various global shape descriptors, the ZMs are one of the most popular shape descriptors which have been adopted by MPEG7 [13] for image retrieval task. The ZMs were originally developed for the gray images. Recently, the quaternion ZMs (QZMs) [4] have been developed for the invariant representation of shape and color features. However, it is shown in [31] that the multi-channel ZMs (MZMs), which are the ZMs of the component images of a color image, are more effective than the QZMs. Thus, in this paper, for the shape representation of color objects, we use the MZMs features. For the performance

138

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

comparison, we use rotation invariant features of QZMs referred to as RQZMs. It is important to note that the ZMs or the MZMs provide low-level shape information. To represent high-level shape information, we use the gradient of a color image which yields texture of the image. For this purpose, we derive the gradient magnitude image of a color image and derive the ZMs of the gradient image. It is pertinent to mention here that both the gradient magnitude image and the magnitude of its ZMs are invariant to image rotation. Therefore, the high-level shape information is represented by the ZMs of the gradient magnitude image referred to as the GZMs. The color information is represented by the color histogram (CH) which provides very simple but effective color information which is invariant to image rotation [12,39]. Scale invariance is achieved when the CH features are normalized by the size of the image. Finally, the MKL approach is used for the classification task. 3. ORIMs-based existing shape and color descriptors 3.1. Shape descriptors Among the various ORIMs-based shape descriptors, the ZMs are the early and widely used shape descriptors for grayscale images. Recently, several of its variants and variants of other ORIMs have been developed for representing shape and color features of color images. In fact, all ORIMs have similar kernel functions and they possess similar properties. They differ marginally in their performance. Their computational framework is the same. Therefore, we select ZMs and its color variant descriptors as the representative of all ORIMs as it is a widely used ORIMs and a standard global descriptor recommended by MPEG-7 [13]. 3.1.1. Zernike moments (ZMs) The ZMs are defined on the unit disk in the polar domain. Let f(r, θ ) be an image function, then the ZMs of order p and repetition q on the unit disk are defined as follows [33]:

Z M p,q ( f ) =

p + 1 2π 1 ∫ ∫ f (r, θ )R p,q (r )e− jqθ rdrdθ ,

π

where p ∈ Z + , q ∈ Z, p − |q| = even, j =

R p,q (r ) =

|q|)/2 ( p− k=0

(1)

0 0

k!

√ −1, and Rp, q (r) is the radial basis function defined as follows:

(−1 )k ( p − k )!  p+|q|   p−|q| 2

−k !

2

 r p−2k .

−k !

(2)

The set of ZMs are orthogonal and complete. Thus, the function f(r, θ ) can be reconstructed as follows:

fˆ(r,

θ) =

pmax 

p 

p=0

q=−p p−|q|=even

Z M p,q ( f )R p,q (r )e jqθ ,

(3)

where pmax is the maximum order of moments. The higher is the value of pmax , the closer is the function fˆ(r, θ ) to f(r, θ ). In digital image processing, the image function f(s, t) of size M × N is a discrete function, where (s, t ) ∈ [0, M − 1] × [0, N − 1], which is defined  in the cartesian coordinate system. The polar parameters (r, θ ) of Eq. (1) at a pixel location (s, t) are computed as rst = x2s + yt2 (xs and yt defined below), and θst = tan−1 (yt /xs ), where θ st ∈ [0, 2π ]. The condition x2s + yt2 ≤ 1 is imposed to restrict the computation on the unit disk. The following transformation is used to map a discrete image of size M × N into a unit disk [36,43]:

xs = and

2s + 1 − M 2t + 1 − N , yt = , s = 0, 1, 2, . . . M − 1; t = 0, 1, 2, . . . , N − 1, D D

(4)

 D=

min √ (M, N ), for inner unit disk, M2 + N 2 , for outer unit disk,

where xs and yt map the coordinate of a pixel (s, t) on the unit disk and the elemental area occupied by each pixel is [xs − 2x , xs + 2x ] × [yt − 2y , yt + 2y ], with x = y = D2 . It may be observed that a digital coordinate (s, t) of the image region [0, M − 1] × [0, N − 1] is mapped to a location (xs , yt ) on the unit disk by translating the origin (0, 0) to the center of the image (M/2, N/2), and then scaling the resulting values by the scaling factor λ = D2 . The resulting coordinate (xs , yt ) ∈ [−1, 1] × [−1, 1] and the condition x2s + yt2 ≤ 1 ensures that the computations are performed on the unit disk, whether it is the inscribed disk or the outer disk. It may be noted that a pixel on the unit disk is assumed to lie at the center of the elemental area. There exists an exact solution to derive the ZMs of an image function f(s, t) using geometric moments which are obtained after performing exact integration of geometric moments over all pixels areas and then summing up the moments [36,43].

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

139

However, this process is computationally slow and numerically instable for high orders of moments. Therefore, a zerothorder approximation to Eq. (1) is obtained as follows:

Z M p,q ( f ) =

M−1 N−1 4( p + 1)   π D2 s=0

f (s, t )R p,q (rst )e− jqθst .

(5)

t=0 x2s +yt2 ≤1

The following equation is used for image reconstruction:

fˆ(s, t ) =

pmax 

p 

p=0

q=−p p−|q|=even

Z M p,q ( f )R p,q (rst )e jqθst , s = 0, 1, . . . , M − 1; t = 0, 1, . . . , N − 1.

(6)

Invariance of ZMs: The ZMs features are translation invariant if the center of the unit disk is placed at the center of the mass of the image. They are made scale invariant by performing the image-to-unit disk mapping as given by Eq. (4). To achieve rotation invariance, there are three approaches. In the first approach, which is used more commonly, the magnitude of the ZMs are taken as the rotation invariants [31]. In the second approach [6,21], the relationship between the ZMs of an image (training image) and ZMs of its rotated version (test image) are used to estimate the rotation angle between them and the estimated rotation angle is used to correct the ZMs of the rotated image to align its ZMs with the ZMs of the non-rotated image. This procedure, called phase normalization, allows to compare two complex numbers and yields approximately double the number of invariants as compared to the magnitude descriptors. The third approach [4,15] also performs the phase normalization to eliminate the effect of rotation, but it does not estimate the rotation angle. In this method, new rotation invariants are obtained by taking the product of two ZMs of a different order but with the same repetition. The effect of the rotation angle is cancelled because one of the two moments used for the product has the positive repetition term (+q) and the second has the negative repetition term (-q). By taking the product of two ZMs with these combinations, the effect of the rotation angle is eliminated. The three types of rotation invariants are explained as follows. Let ZMp, q (f) represent the ZMs of the image f(s, t), then the ZMs can be written in the polar form as:

Z M p,q ( f ) = |Z M p,q ( f )|e jϕ p,q ( f ) ,

(7)

where |ZMp, q (f)| is the magnitude and ϕ p, q (f) is the argument (phase angle) of the complex numbers ZMp, q (f). If the image f(s, t) is rotated by an angle α and its ZMs are represented by ZMp, q (fα ), then α Z M p,q ( f α ) = |Z M p,q ( f α )|e jϕ p,q ( f ) .

(8)

The magnitude and phase of the ZMs of the original image f(s,t) and the rotated image fα (s,t) have the following relationship [6,21,35]

Z M p,q ( f α ) = Z M p,q ( f )e− jqα ,

(9)

or, in the polar form: α

|Z M p,q ( f α )|e jϕ p,q ( f ) = |Z M p,q ( f )|e jϕ p,q ( f ) .e− jqα .

(10)

This yields

|Z M p,q ( f α )| = |Z M p,q ( f )|,

(11)

ϕ p,q ( f α ) = ϕ p,q ( f ) − qα .

(12)

and

Therefore, the rotation angle α can be estimated by

α = (ϕ p,q ( f ) − ϕ p,q ( f α ) )/q.

(13)

The first type of invariants considers the magnitude of the ZMs using Eq. (11). The second type of rotation invariants uses Eq. (13) to determine the rotation angle between two images (training and test image). The estimated rotation angle is then used to correct the ZMs of one of the two images to align their real and imaginary parts. If the two images are the same but rotated versions of each other, then the real and imaginary parts of the two complex moments will be the same. However, if they are different, then two complex moments will be different. Although this proposition sounds theoretically correct, the estimated rotation angle is not always accurate due to the digital nature of the image. There are three different approaches to estimate α . The first approach [21] considers p = 3 and q = 1 and determines α using Eq. (13) which yields α = ϕ3,1 ( f ) − ϕ3,1 ( f α ). The second approach uses all adjacent phases and estimates α [6]. In the third approach, an optimization problem is solved [18,30]. Although the third approach provides a better estimate of the rotation angle, it is computationally intensive because it solves an optimization problem.

140

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152 Table 1 Comparison of recognition rates (%) obtained on normal and rotated datasets of SIMPLIcity database for intensity component I = (R + G + B)/3 obtained by various ZMs-based rotation invariant descriptors using various distance-based classifiers. Distance function

L1 -norm L2 -norm Extended Canberra

χ2

Square-chord

Rotated dataset (25◦ )

Normal dataset |ZMs|

ZMs-phase normalized

∅qp, p [4,31]

41.20 43.60 41.40 42.00 41.60

37.80 37.60 36.40 35.60 37.00

41.20 39.00 41.80 38.40 41.60

|ZMs|

ZMs-phase normalized

∅qp, p [4,31]

41.40 42.00 42.40 42.20 41.80

36.60 38.40 32.20 36.40 36.80

40.40 37.60 42.00 39.20 42.20

The third type of the rotation invariants are determined by canceling the effect of the rotation angle by defining a new invariant [4,15]:

∅qp,p ( f α ) = Z M p,q ( f α )Z M p ,−q ( f α ) = Z M p,q ( f )e− jqα Z M p ,−q ( f )e jqα = Z M p,q ( f )Z M p ,−q ( f ).

(14)

As shown in [31], the performance of this approach also is not better than the magnitude-based ZMs descriptors. We have conducted several experiments for the comparative performance analysis and observed that the ZMs magnitude-based invariants |ZMs| perform better than the two other types of invariants. To substantiate our observation, we conduct experiments on the SIMPLIcity dataset [42] which has 10 0 0 images of 10 different categories of the objects with 100 images for each category. The details of the database are given in Section 5, Experimental analysis. The training and test datasets consist of 500 images each with 50 images per class. The experiments are conducted on two sets of test images. The first test dataset is the normal dataset (without rotation). In the second test dataset, each image is rotated by 25◦ . We use the intensity component I = (R + G + B)/3 of the color images to compute their ZMs for evaluating the rotation invariance performance of the three descriptors: (1) ZMs magnitude (|ZMs|), (2) phase corrected ZMs (ZMs-phase normalized), and (3) q the descriptor ∅ p,p , given by Eq. (14). The recognition results are given in Table 1 for the various invariants and for the commonly used five distance measures. It is shown that the ZMs magnitude descriptor (|ZMs|) performs much better than q the two other descriptors, followed by the descriptor ∅ p,p and lastly by the phase-corrected ZMs descriptor. Therefore, in this work, we use ZMs magnitude as the invariants. 3.1.2. Multi-channel ZMs (MZMs) In a recent paper [31], the authors have proposed the multi-channel ZMs (MZMs) of color images to represent shape and color information. The approach is straightforward. It treats each channel of a color image as a gray-scale image and computes its ZMs. The MZMs of a color image f(s,t)=(fR (s,t), fG (s,t), fB (s,t)) are the set of the moments of the component images fR (s, t), fG (s, t), and fB (s, t) which represent the red (R), green (G), and blue (B) channels of a color image, respectively. Thus, we extend Eq. (5) for the computation of the MZMs as follows [31]:

MZ M p,q ( fc ) =

M−1 N−1 4( p + 1)   2 πD s=0

fc (s, t )R p,q (rst )e− jqθst , c ∈ {R, G, B}.

(15)

t=0 x2s +yt2 ≤1

The R, G, and B components of a color image are reconstructed as follows: p max fˆc (s, t ) =

p=0

p  q=−p p−|q|=even

MZ M p,q ( fc )R p,q (rst )e jqθst , c ∈ {R, G, B}, s = 0, 1, . . . , M − 1; t = 0, 1, . . . , N − 1.

(16)

The number of the MZMs coefficients computed using Eq. (15) will be three times the number of ZMs provided by Eq. (5) for the same order of moments p = pmax . Invariance of MZMs: The MZMs are obtained by concatenating the ZMs of the monochrome images fR (s, t), fG (s, t), and fB (s, t) of a color image f(s, t). Therefore, the invariants of the MZMs are obtained in a way similar to the invariants of the monochrome images. One may refer [31] for further details. 3.1.3. Quaternion ZMs (QZMs) In an attempt to fuse shape and color information at the level of feature extraction stage, Chen et al. [4] have proposed quaternion theory of moments for color images. Subsequently, several other quaternion moments have been developed [4,5,11,15,28,32,41,44], but their design and performance are more-or-less similar to that of the QZMs [4]. Therefore, we select the QZMs as their representative because ZMs have been used more frequently in many applications than any other moment. Let f(r, θ ) be a color image in polar coordinates which can be represented by a pure quaternion number with real part zero, i.e., f (r, θ ) = 0 + fR (r, θ )i + fG (r, θ ) j + fB (r, θ )k where i2 = j 2 = k2 = −1, i j = k, ji = −k, jk = i, k j = −i, ki = j, ik = − j.

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

141

Since the multiplication of two quaternion numbers is not commutative, the right-side QZMs of a color image f(r, θ ) of order p and repetition q on a unit disk is defined as:

QZ M p,q ( f ) =

p + 1 2π 1 ∫ ∫ R p,q (r ) f (r, θ )e−μqθ rdrdθ ,

π

(17)

0 0

√ where p ∈ Z + , q ∈ Z, p − |q| = even, and μ = (i + j + k)/ 3 is a pure quaternion. Given all QZMp, q (f) upto a maximum order pmax , the image function is reconstructed as:

fˆ(r, θ ) =

pmax 

p 

p=0

q=−p p−|q|=even

QZ M p,q ( f )eμqθ R p,q (r ).

(18)

The computational framework of QZMs are similar to the ZMs given in Section 3.1. Thus, the zeroth-order approximation of Eq. (17) for a discrete image function f(s, t) is given as:

QZ M p,q ( f ) =

M−1 N−1 4( p + 1)   R p,q (rst ) f (s, t )e−μqθst . π D2 s=0

(19)

t=0 x2s +yt2 ≤1

The relationship between the right-side QZMs and ZMs of component functions fR (s, t), fG (s, t), and fB (s, t) of a color image f(s, t) are given as follows [4]:

QZ M p,q ( f ) =

p + 1 2π 1 ∫ ∫ R p,q (r )[ fR (r, θ )i + fG (r, θ ) j + fB (r, θ )k]e−μqθ rdrdθ .

π

(20)

0 0

After simplifying Eq. (20), the QZMp, q (f) can be expressed as [4]:

QZ M p,q ( f ) = A p,q + iB p,q + jC p,q + kD p,q ,

(21)

where

1 A p,q = − √ [Im(Z M p,q ( fR ) ) + Im(Z M p,q ( fG ) ) + Im(Z M p,q ( fB ) )], 3 1 B p,q = Re(Z M p,q ( fR ) ) + √ [Im(Z M p,q ( fG ) ) − Im(Z M p,q ( fB ) )], 3 1 C p,q = Re(Z M p,q ( fG ) ) + √ [Im(Z M p,q ( fB ) ) − Im(Z M p,q ( fR ) )], 3 1 D p,q = Re(Z M p,q ( fB ) ) + √ [Im(Z M p,q ( fR ) ) − Im(Z M p,q ( fG ) )]. 3

(22)

Here, ZMp, q (fR ), ZMp, q (fG ), and ZMp, q (fB ) are the ZMs of the R, G, and B components of a color image computed independently using Eq. (5). It is observed here that the QZMs are expressed as a linear combination of the ZMs of the component images fR (x, y), fG (x, y), and fB (x, y), therefore, the computation of QZMs involves the computation of the ZMs of the component images as the first step. A similar definition exists for the left-side QZMs. The right-side and left-side QZMs are not independent as one can be obtained from another using their conjugates [4]. q Invariance of QZMs: The rotation invariants of the QZMs, referred to as RQZMs, are defined by the coefficients ∅ p,p ( f ) which are derived as [4]:



∗

∅qp,p ( f ) = Q Z M p,q ( f ) Q Z M p ,q ( f ) ,

(23)

where p and p are independent parameters for the order of moments but the repetition parameter q must satisfy the conditions: |q| ≤ p; |q| ≤ p , p − |q| = even, p − |q| = even. The scale invariants Lp, q (f), referred to as SQZMs, are defined by [4]:

L p,q =

u  k  

−(v+2) q q cu,k dk,t QZ Mw,q ( f ) , |QZ M0,0 ( f )|

(24)

k=0 t=0

where

|QZ M0,0 ( f )| =

(Z M0,0 ( fR ))2 + (Z M0,0 ( fG ))2 + (Z M0,0 ( fB ))2 ,

+ u + k )! k ! ( q + k )! π u = ( p − q )/2, v = q + 2k, w = q + 2t, cu,k = (−1 )u+k q+2πu+1 k!(u(q−k )!(q+k)! , and dk,t = (k−t )!(q+k+t+1)! . The translation invariance is achieved by placing the center of the unit disk at the common center of the mass of the color image [4]. q

q

142

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

It is noted that the rotation and scale invariants, RQZMs and SQZMs, are quaternion numbers which con1 sist of four components for each coefficient. Thus, the number of coefficients of the RQZMs is L = 12 ( pmax + 3 ) (2 p2max + 12 pmax + 10 + 3(−1 ) pmax + 3) and SQZMs features are four times the number of magnitude features of ZMs. 4. The proposed method The proposed method for geometrically invariant shape, texture and color features of color images is based on the ZMs and color histograms. As it is well known that the magnitude of the coefficients of the ZMs are geometrically invariant under translation, rotation, and scale and robust to image noise and it can handle occlusion to a large extent, it is used more frequently as a global descriptor than any other rotation invariant moment. In the proposed method, it is used to represent shape at two levels - at intensity-level which provides ZMs features at low-level (shape features) and at gradient(edge)level which provides ZMs features at high-level (texture features). The ZMs of the color image provides the shape feature at low-levels because the ZMs of the images are computed directly from the pixel intensities. The ZMs of the color gradient image provides high-level shape features, as these are computed from the gradient of color images. The normalized color histograms (CH) are one of the most effective geometrically invariant global features [12,39]. Hence we use them to describe the color features. 4.1. Low-level shape features The multi-channel ZMs (MZMs) have proven to be very effective geometrically invariant global shape descriptors. Even though many quaternion representations of color images have been proposed recently to derive moments of color images, it is shown in [31] that the multi-channel approach provides the best image recognition performance compared with the quaternion moments. Let fR (x, y), fG (x, y) and fB (x, y) denote the three component images of a color image f(x, y), i.e., f (x, y ) = ( fR (x, y ), fG (x, y ), fB (x, y )). Then, the MZMs invariants of order p and repetition q are defined by

MZ M p,q ( f ) = (|Z M p,q ( fR )|, |Z M p,q ( fG )|, |Z M p,q ( fB )| ),

(25)

where Z M p,q ( fc ), c{R, G, B} are ZMs of the component images. The magnitude coefficients of ZMs of the component images |ZMp, q (fR )|, |ZMp, q (fG )|, and |ZMp, q (fB )| are concatenated to form a low-level feature set SL : 3

SL = ∪

c=1

{|Z M p,q ( fc )|, 0 ≤ p ≤ pmax , 0 ≤ q ≤ p, p − q = even}, c ∈ {R, G, B}.

(26)

The negative repetition terms are not considered to avoid redundancy in the ZMs feature representation as |Z M p,q ( fc )| =

|Z M p,−q ( fc )|.

4.2. High-level shape (texture) features The high-level shape features are obtained by deriving ZMs of the gradient image of a color image, referred to as GZMs. Since the gradient of an image provides its texture, the high-level shape features represent the texture of an image. The gradient of a color image is obtained by Di Zenzo method [10]. Let fx (x, y) and fy (x, y) represent the x-gradient and ygradient images. These two gradient images are derived as

∂ fR (x, y ) ∂ fG (x, y ) ∂ fB (x, y ) i+ j+ k, ∂x ∂x ∂x ∂ fR (x, y ) ∂ fG (x, y ) ∂ fB (x, y ) fy (x, y ) = i+ j+ k, ∂y ∂y ∂y fx (x, y ) =

where i, j,

(27)

k represent unit vectors along R, G, and B axes, respectively. We define the terms gxx , gyy , and gxy as follows:

gxx = fx (x, y ). fx (x, y ), gyy = fy (x, y ). fy (x, y ), gxy = fx (x, y ). fy (x, y ).

(28)

Then the direction image θ (x, y) and the gradient magnitude image G(x, y) are given by



1 2

θ (x, y ) = tan−1 and

G(x, y ) =

 1  2

2gxy gxx − gyy



gxx + gyy +



,



(29)

gxx − gyy

2

1/2 + 4g2xy

.

(30)

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

143

Fig. 1. (a) original images f(x, y) of “bicycle”, “player”, “horse”, “flower”, and “motorbike”; (b) gradient images G(x, y) of images in (a); (c) rotated images fα (x, y) of images in (a); (d) gradient images Gα (x, y) derived from images in (c); and (e) rotated versions of images G(x, y) in (b) denoted by Gα (x, y). The boundaries and enclosing circles are not the part of the images.

The gradient magnitude G(x, y) is the maximum in the direction θ (x, y). Its minimum value Gmin (x, y) will be in the direction θ ± π2 which is



Gmin (x, y ) =

1 2





gxx + gyy −



gxx − gyy

2

 12

+ 4g2xy

.

(31)

An important characteristic of the gradient magnitude function G(x, y) is its invariance to the rotation. That is, if the image f(x, y) is rotated by an angle α , then the magnitude of the gradient image G(x, y) remains unchanged. This is a very useful property of the gradient image to represent the high-level details of the image f(x, y) under rotation invariance condition. The gradient image G(x, y) is a gray-scale image which represents the gradient magnitude of the color image at pixel location (x, y). To demonstrate the rotation invariance of G(x, y) empirically, we derive the gradient magnitude images G(x, y) of five color images “bicycle”, “player”, “horse”, “flower”, and “motorbike”, taken from PASCAL VOC 2005, Soccer, SIMPLIcity, Flower, and Caltech-101 datasets, respectively. Later, these datasets have been used for conducting experiments and their characteristics are discussed in Section 5.1. We rotate these images by an “odd” angle α = 37◦ (to avoid symmetry) to derive their rotated versions fα (x, y). The gradient images Gα (x, y) for each of the images are obtained. If the gradient images G(x, y) are rotated by an angle α to yield the images Gα (x, y), then we observe that the images Gα (x, y) and Gα (x, y) are similar. These phenomena are illustrated in Figs. 1(a)–(e), where we plot the five color images f(x, y) in the first column of Fig. 1, i.e., Fig. 1(a). The gradient images G(x, y) are shown in Fig. 1(b) which have been plotted after mapping the gradient values to the range 0 to 255 to highlight the edges. The original five images f(x, y) are then rotated by an

144

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152 Table 2 The moment coefficients computed using ZMs for the gradient images G(x, y), Gα (x, y) and Gα (x, y) for the five images ‘bicycle’, ‘player’, ‘horse’, ‘flower’, and ‘motorbike’. Image

Gradient Images

Magnitude of GZMs moments |GZM0, 0 |

|GZM1, 1 |

|GZM2, 0 |

|GZM2, 2 |

|GZM3, 1 |

|GZM3, 3 |

|GZM4, 0 |

|GZM4, 2 |

|GZM4, 4 |

bicycle

G(x, y) Gα (x, y) Gα (x, y) G(x, y) Gα (x, y) Gα (x, y) G(x, y) Gα (x, y) Gα (x, y) G(x, y) Gα (x, y) Gα (x, y) G(x, y) Gα (x, y) Gα (x, y)

10.12 8.792 10.11 13.77 13.97 13.77 19.71 17.04 19.71 5.874 5.262 5.871 16.62 15.90 16.56

1.854 1.585 1.817 1.387 1.361 1.213 1.343 1.169 1.390 0.356 0.307 0.411 2.220 2.098 2.502

7.587 6.378 7.612 18.07 17.81 18.10 16.79 14.48 16.85 4.339 3.794 4.333 19.73 18.36 19.73

4.191 3.655 4.206 4.298 4.495 4.344 7.436 6.479 7.412 1.493 1.346 1.488 7.335 7.128 7.176

3.161 2.716 3.116 2.237 2.214 2.071 2.914 2.543 3.009 0.748 0.682 0.704 6.026 5.633 6.433

1.763 1.517 1.732 2.066 2.060 1.969 1.340 1.172 1.473 0.351 0.308 0.362 1.623 1.546 1.830

3.552 3.164 3.524 6.547 6.323 6.366 4.352 3.698 4.213 1.103 1.015 1.147 15.91 14.77 15.54

4.611 3.927 4.662 5.352 5.306 5.507 5.039 4.437 5.026 0.630 0.564 0.631 3.608 3.133 3.493

2.151 1.954 2.113 1.599 1.664 1.511 3.825 3.289 3.834 1.737 1.572 1.729 0.151 0.091 0.156

player

horse

flower

motor-bike

angle α = 37◦ as shown in Fig. 1(c), and their gradient images Gα (x, y) are obtained which are depicted in Fig. 1(d). The five gradient images G(x, y) in Fig. 1(b) are rotated by the same angles α to yield the images Gα (x, y) which are plotted in Fig. 1(e). The image boundaries and the circles shown in the figures are not the parts of the edges. The boundaries are drawn to visualize the image region and the circles show the regions for the computation of the ZMs. It is observed that the images Gα (x, y) and Gα (x, y) are similar. The gradient magnitude image G(x, y) is computed after deriving the partial derivative fx (x, y) and fy (x, y) using the Sobel’s x-directioin and y-direction gradient operators. The ZMs of G(x, y), referred to as GZMp, q , are obtained by using Eq. (5) in which f(x, y) is replaced by G(x, y). To prove the rotation invariant features GZMs for the five images, we compute |GZMp, q |values for the maximum order, pmax = 4 and display these values in Table 2. An order of the √ magnitude study of the gradient function G(x, y) shows that the maximum value of G(x, y) can be as large as L [24(2 + 5 )]1/2 where √ L is the maximum levels of intensity of all components. We have scaled down the values of G(x, y) by the factor [24(2 + 5 )]1/2 and displayed the scaled values in the table. It is observed that the intra-class variations in |GZMp, q | are very small as compared to the inter-class variation. The intra-class variations are due to the discretization errors that occur during rotation and gradient computation. 4.3. Color features Normalized color histograms (CH) are one of the most effective global color feature descriptors which are invariant to rotation and scale [12,39]. They are derived by dividing the color ranges of intensity levels of component images into a number of color bins. Although there are a number of the color spaces such as the RGB, HSI, and L∗ a∗ b, there is not much difference among the performance of the CH features derived from these color spaces [22]. In this paper, we use the RGB color space. We divide the intensity levels of each of the three channels into equal number of bins and assign a color pixel to an appropriate bin to derive the desired CH features. The color bins are divided by the size of the image to obtain normalized histogram features. In this paper, we have divided R, G, and B color ranges into 4 × 4 × 4 = 64 bins. 4.4. Fusion of shape and texture features The task of combining discriminative and complementary features is very challenging. Since the shape and color features are obtained using different modalities, their fusion at the feature level by concatenating the feature vector may not provide high performance by using a distance-based classifier. More often, the fused features provide less recognition rates than their individual performance. This problem occurs more frequently when the ranges of the feature values are quite different which is the situation in the proposed system. The range of the |MZMp, q | can go upto 255 as the coefficient |MZM0, 0 | represents the average intensity of an image which is the largest component in the set of the MZMs features. The range of |GZMp, q | is approximately ten times less than |MZMp, q | which can be proved by the order of magnitude analysis of the gradient function. On the other hand, the normalized CH features lie in the range 0 to 1. One of the solutions to the unequal range problem of the feature vectors is to bring the features of MZMs and GZMs to the range 0 to 1. But the normalized features of the MZMs and GZMs provide lower performance than their non-normalized forms using a distance-based classifier. This problem arises because the magnitude of the coefficients of moments vary significantly. Thus, a concatenation of the feature values originating from different modalities poses restriction on the recognition performance. In the recent past, kernel methods, which are the extensions of the support vector machines (SVM) for combining multiple modalities, have become very popular in object recognition and image classification. Like the SVM, the key aspect of the kernel methods is the introduction of nonlinearity in the decision functions by performing a nonlinear mapping of the

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

145

input feature space into a high-dimensional space and then constructing a hyper-plane for making a binary decision about the class of a new feature vector. For this purpose, kernel methods make use of kernel functions which define a measure of similarity between pairs of feature vectors. In the context of the combination of features from different modalities, it is useful to associate a kernel to each modality. This process is called multiple kernel learning (MKL) [9,2]. The MKL has become a potent object classification method when features from different modalities such as the shape, color, and texture are combined to provide high classification results. It has been observed to be a very powerful tool for classification in many image processing and computer vision tasks. The MKL combines the kernel functions linearly for different modalities and optimizes the resulting kernel jointly. Let the number of the modalities be F and x and x be the two feature vectors of the same modality, e.g., the feature vectors of the train and test images. Also, let km (x, x ) denote a kernel for the m-th modalF  ity, then MKL maximizes the fused kernel k∗ (x, x ) = βm km (x, x ) with respect to the coefficients βm , m = 1, 2, . . . , F and m=1

with respect to the parameters used in the SVM training. In this paper, we use the following kernel function:





Km x, x = e−

(

dm x,x Sm

),

(32)

x )

where dm (x, is a distance function for the m-th modality, and Sm is the average of the pairwise distances of the feature vectors used for the training. The regularization parameter C of SVM is selected by cross-validation (CV) approach by taking various values of C ∈ {2−6 , . . . , 213 }. The regularization parameter p on the kernel weight is set as p = 2. We use the SMO-MKL [40] code available online with these settings of the SVM parameters. 5. Experimental analysis For conducting experimental analysis, we have implemented the various descriptors using Microsoft Visual C++ 8.0 on Windows 8 environment on a PC with 1.90 GHz CPU, and 4 GB RAM. The classification is performed using the MATLAB and SVM [3] and MKL [40] functions. We have not come across with any prior studies on the recognition rate of color images using the moment-based global geometric invariant descriptors using SVM/MKL. A recent work on QPHFMs [41] descriptors, which are also rotation invariant, compares their results with the magnitude of QZMs using distance-based classifier. The results of QPHFMs are shown to be better than QZMs only under a high-noise condition. Therefore, a direct comparison of the recognition rates with similar settings is not possible. However, we have selected the quaternion moment based ZMs descriptor (QZMs) [4] and used their invariant RQZMs for the recognition rate using SVM and MKL classifiers. Also, in view of our previous work [31], in which we had reported better results for the MZMs than the RQZMs, the MZMs are also used for the comparison purpose. 5.1. Datasets and their characteristics The task of color object recognition is carried out on PASCAL VOC 2005 [51] (shape predominance), Soccer [52] (color dominance), SIMPLIcity [42] (color dominance and a variety of objects), Flower [50] (color and shape parity), and Caltech-101 [49] (shape and color co-interference) datasets under normal, rotation and scale. These datasets represent shape and color information in different forms of their dominance. We have considered PASCAL VOC 2005 instead of its higher versions because of the speed limit of our PC as the dataset contains only 1373 images of four classes – car (547), motorbike (430), bicycle (228), and person (168). The Soccer dataset consists of 7 classes of different soccer teams and each class contains 40 images. Thus, the size of the dataset is 280. The 7 classes differ in the color of the players’ jersey. The SIMPLIcity dataset has 10 0 0 images, 10 classes and 100 images in each class. It is a color dominant dataset with a variety of objects. The classes are: people, beach, buildings, bus, dinosaur, elephant, flower, horse, glacier and food. The Flower dataset consists of 1360 images which are categorized into 17 different classes representing the different species of flowers with 80 images per class. The Caltech-101 dataset consists of 9144 images classified into 102 object classes and one background class. Each class of Caltech-101 dataset consists of a variable number of images ranging from 40 to 800 images. For the experimental purpose, we have chosen randomly 10 object classes from Caltech-101 dataset from 102 classes which have 100 or more images per class and randomly selected 100 images from each class which results in a dataset of 10 0 0 images. The method for the preparation of various datasets under different conditions is explained as follows. Normal dataset: Under normal condition, the train and test databases contain the original images given in the dataset. The PASCAL VOC 2005 dataset is available in three parts: training (342), validation (342) and test (689) images. For conducting experiments, the validation images are merged with the training dataset yielding a size of 684 images for training. In the case of Soccer, SIMPLIcity, Flower and Caltech-101 datasets half of the images from each class are randomly taken for training and the other half are used in the test database. A five-fold cross-validation is performed to derive the values of regularization parameter C for all five databases. Rotated and scaled datasets: The rotated datasets for each of the five databases are formed from their test datasets by rotating each dataset by an angle α [0◦ , 90◦ ] selected randomly. Similarly, the scaled datasets are constructed using the test datasets by scaling each image by a scaling factor λ ∈ {0.5, 0.75, 1.25, 1.5, 1.75}, selected randomly. Figs. 2(a)–(e) show one image of each dataset under normal, rotation, and scale conditions.

146

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

Fig. 2. Original, rotated, and scaled images from the datasets: a) PASCAL VOC 2005, b) Soccer, c) SIMPLIcity, (d) Flower, and (e) Caltech-101. Table 3 Recognition rates (%) obtained by various descriptors using single kernel learning (SVM) classifier under normal condition on the five datasets with χ 2 distance function. Databases

PASCAL VOC 2005 (Shape dominance) Soccer (Color dominance) SIMPLIcity (Color dominance) Flower (Color and shape parity) Caltech-101 (Shape and color co-interference)

Methods CHs [39] L = 64

|ZMs| L = 56

|GZMs| L = 56

|MZMs|[31] L = 60

RQZMs [4] L = 160

74.60 64.29 83.00 59.56 67.40

63.86 25.71 49.40 28.24 70.00

69.52 23.57 51.20 29.56 70.80

63.43 42.86 66.00 44.56 72.00

62.84 35.86 55.80 33.38 70.40

5.2. Performance comparison under various conditions 5.2.1. Normal condition Under this condition, we consider all five datasets in their normal forms. Out of the various distance functions such as the L1 -norm, L2 -norm, extended Canberra, χ 2 and square-chord, we select the χ 2 distance function [48] for the SVM and MKL kernel function as this provides overall the best results. The single cues are represented by CH (color), ZMs (shape using intensity component of a color image), MZMs [31] (early fusion of shape and color), RQZMs [4] (early fusion of shape and color), and GZMs (proposed texture). We consider the MZMs and RQZMs features as single cues, although these features represent both shape and color because they represent their early fusion at the level of feature extraction. The recognition rates for the single cues are given in the Table 3 for the PASCAL VOC 2005 (shape dominance), Soccer (color dominance), SIMPLIcity (color dominance), Flower (color and shape parity), and Caltech-101 (shape and color co-interference), respectively. It is observed from Table 3 that for the PASCAL VOC 2005 dataset, the highest recognition rate 74.60% is achieved by the CH features. Among the shape features (ZMs, MZMs, and RQZMs), the highest recognition rate 63.86% is achieved by the ZMs features. The proposed texture descriptor GZMs provides the highest recognition rate of 69.52% which is 5.66% more than the value 63.86% obtained by the ZMs features which are the best among the three shape features. This shows that the proposed geometrically invariant texture features GZMs are very effective for shape dominated features of the objects. Another important outcome observed here is the high performance of the color features (CH) compared with the shape features (ZMs), early fusion of color and shape (MZMs and RQZMs), and the texture features (GZMs). The high recognition rates achieved by the color features (CH) signify the superiority of the CH features over shape and texture features despite the fact that the database is shape dominated and the color features are the straightforward histograms of the color components. The recognition rates for the color dominated datasets Soccer and SIMPLIcity are shown in Table 3. We choose these two datasets for color dominance because the recognition rates achieved on them by the various descriptors are significantly different. While the CH features provide the highest recognition rate 64.29% on the Soccer dataset, it even achieves a much higher recognition rate of 83.00% on the SIMPLIcity dataset. Not only the color, the performance of the shape, texture, and the early fusion of color and shape has low recognition rates on the Soccer dataset than the SIMPLIcity dataset. While comparing the relative performance of the color features with the shape and texture features, the color features provide

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

147

Table 4 Recognition rates (%) obtained using multiple kernel learning approach under normal condition on the five datasets for χ 2 distance function. Databases

Best recognition rates of single cue (ref. Table 3)

Methods CH+ZMs

CH+GZMs

CH+MZMs

CH+ZMs+GZMs

CH+GZMs+MZMs

PASCAL VOC 2005 (Shape dominance) Soccer (Color dominance) SIMPLIcity (Color dominance) Flower (Color and shape parity) Caltech-101 (Shape and color co-interference)

74.60 (CH) 64.29 (CH) 83.00 (CH) 59.56 (CH) 72.00 (MZMs)

75.76 64.29 84.80 64.12 79.00

78.96 66.43 83.00 67.50 81.40

76.20 62.14 84.00 64.41 79.60

79.04 67.86 84.20 66.03 84.60

79.54 67.14 85.20 66.32 84.60

very high recognition rates. While the CH features provide the highest recognition rate of 64.29% on the Soccer dataset, the shape features (ZMs) provide 25.71%, texture features (GZMs) 23.57%, early fusion of color and shape (MZMs and RQZMs) 42.86% and 35.86%, respectively. As these two datasets are color dominated, the MZMs and RQZMs features which represent an early fusion of the color and shape, are expected to provide better performance than the descriptor ZMs and GZMs which represent only shape and texture features, which is reflected by their lower recognition rates. A similar trend of the recognition rate is also observed on the SIMPLIcity dataset. The highest recognition rates for the CH, ZMs, GZMs, MZMs, and RQZMs are 83.0 0%, 49.40%, 51.20%, 66.0 0%, and 55.80%, respectively. Although the performance of the MZMs and RQZMs is less than the CH features, they perform much better than the GZMs and ZMs. The performance of MZMs is better than RQZMs, both of whom represent an early fusion of the color and shape. On this dataset too, the proposed texture features GZMs perform better than the shape features ZMs, reaffirming the better discriminative power of the GZMs features. When there is a parity of shape and color cues, the highest recognition rate 59.56% is achieved by the color features (CH) on the Flower dataset. The trend of the recognition rates achieved by the shape (ZMs), texture (GZMs), and early fusion of color and shape (MZMs and RQZMs) on the Flower dataset is almost the same as for the Soccer and SIMPLIcity datasets which are color dominated. The highest recognition rates achieved by these features are 28.24%, 29.56%, 44.56%, and 33.38%, respectively. Therefore, even if there is a parity of color and shape cue, the performance of the color is much better than the shape. Like the Soccer and SIMPLIcity datasets which are color dominated and consequently, for which the CH features are expected to perform better, the CH features perform better even for a parity between color and shape features. This is not only evident from the high performance of the CH features as compared to the other features, it is also evident from the high performance of the MZMs and RQZMs features compared to the GZMs and ZMs features. The MZMs and RQZMs are the early fusion of the color and shape features. Therefore, the embedding of the color cue with the shape cue in the MZMs and RQZMs enhances their discriminative power compared to the GZMs and ZMs which represent either texture or shape. The performance of the MZMs features is much better than RQZMs features, yielding a difference of 11.18% between their best recognition rates of 44.56%, and 33.38%, respectively. Here too, the proposed texture descriptor GZMs performs better than the shape features ZMs. The recognition rates achieved by all cues on the Caltech-101 dataset which represents shape and color co-interference are not much different. The highest recognition rate 72.00% is achieved by MZMs (early fusion of shape and color), followed by GZMs (texture) 70.80%, RQZMs (early fusion of shape and color) 70.40%, ZMs (shape) 70.00%, and lastly by CH (color) 67.40%. These results demonstrate that the shape and texture features can be as effective as the color features where there is a co-interference of color and shape. It is also observed that the proposed texture features GZMs provide better results than the shape features ZMs. Also, MZMs perform better than the RQZMs. Thus, reconfirming the superiority of MZMs over RQZMs, and GZMs over ZMs, as observed in the above experiments. To analyze the effect of the discriminative powers of the fusion of the various descriptors using the MKL technique, we form the five combinations: CH+ZMs, CH+GZMs, CH+MZMs, CH+ZMs+GZMs, and CH+GZMs+MZMs. These combinations represent color and shape (CH+ZMs); color and texture (CH+GZMs); color and early fusion of color and shape (CH+MZMs); color, shape and texture (CH+ZMs+GZMs); and color, texture and early fusion of color and shape (CH+GZMs+MZMs). After having established the superiority of MZMs over RQZMs, we consider MZMs and ignore RQZMs to represent an early fusion of color and shape as the former provides better recognition rates as observed in Table 3. The effect of the MKL approach on the recognition rate and their combined effect is studied on the five datasets and the recognition rates are shown in Table 4. The first column of the table displays the best recognition rates achieved by the single cues for each of the distance measure. It is observed that the fusion of multiple cues has significantly improved the recognition rates on all datasets except for the SIMPLIcity. The maximum recognition rate of 79.54% on PASCAL VOC 2005 dataset using the multiple cues is obtained by (CH+GZMs+MZMs) as shown in the table. It is also very interesting to note that the early fusion of color and shape does not provide much improvement over their late fusion which is evident from the recognition rates yielded by CH+GZMs+MZMs (79.54%), and CH+ZMs+GZMs (79.04%). A similar improvement is also achieved on the other datasets except for the SIMPLIcity dataset on which the improvement is 2.20%. The maximum improvement of 12.60% achieved by the multiple cues over the single cue has been observed on the Caltech-101 dataset which represents shape and color cointerference of the objects. The above observations reveal that when the color cue is dominant, then color features provide very high recognition rates. For the objects having shape and color co-interference, and color and shape parity, the improve-

148

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152 Table 5 Recognition rates (%) obtained using multiple kernel learning approach under the rotation condition on the five datasets for χ 2 distance function. Images are rotated by angles taken randomly between 0◦ and 90◦ . Databases

PASCAL VOC 2005 (Shape dominance) Soccer (Color dominance) SIMPLIcity (Color dominance) Flower (Color and shape parity) Caltech-101 (Shape and color co-interference)

Methods

Normal Rotation Normal Rotation Normal Rotation Normal Rotation Normal Rotation

CH+ZMs

CH+GZMs

CH+MZMs

CH+ZMs+GZMs

CH+GZMs+MZMs

75.76 75.34 64.29 62.86 84.80 84.60 64.12 64.12 79.00 79.00

78.96 78.25 66.43 65.00 83.00 82.80 67.50 66.44 81.40 81.40

76.20 74.60 62.14 62.14 84.00 84.00 64.41 63.38 79.60 78.20

79.04 78.37 67.86 65.00 84.20 84.00 66.03 65.44 84.60 84.20

79.54 78.66 67.14 67.14 85.20 85.00 66.32 66.32 84.60 84.00

ment in the recognition rates by fusing color, shape, and texture features is significant. In fact, the combination of the color and proposed texture features provide much more improvement than the combination of the color and shape features. Some of the important observations are worth mentioning here after going through the trends of the recognition rates mentioned in Tables 3 and 4. First important observation is that among the three cues - color, shape and texture, the color cue is very effective for object recognition which, in fact, is a universally accepted observation in the color object recognition problems. Except for the shape dominated objects (PASCAL VOC 2005) and shape and color co-interference objects (Caltech101), it performs consistently much better than the shape and texture features. Since all databases (or all objects) may not be color dominated, the fusion of shape and texture features using the MKL classifier provides significant improvements in color object recognition. This observation establishes the relevance of the proposed work because it is robust to color, shape and texture variations. The second noteworthy observation is regarding the discriminative power of the proposed GZMs texture features compared to the ZMs-based shape features. GZMs performs consistently better than the ZMs features across all datasets and across all distance functions. 5.2.2. Rotation A major objective of the present work is to combine color, shape, and texture features to provide invariance to the global in-plane rotation. All these three descriptors are rotation invariant. It is well known that the color features (CH) and the magnitude of the ZMs and their color counterparts MZMs and RQZMs are rotation invariant. To strengthen the rotation invariance of the descriptors, we have proposed the ZMs of the magnitude of the gradient image (GZMs) for the texture cue of an image and shown that the magnitude of GZMs is rotation invariant. We next perform the experiments for the recognition rate under the in-plane rotation of the images. For this purpose, we conduct experiments for the five combinations of the cues – CH+ZMs, CH+GZMs, CH+MZMs, CH+ZMs+GZMs, and CH+GZMs+MZMs. The recognition rates for the rotated datasets of the five databases are shown in Table 5. For the comparison of the results, the recognition rates on the normal datasets are also given in the table. It is recalled that the angle of rotation for an image is selected randomly between 0◦ and 90◦ . The angles between 90◦ and 360◦ are not required for conducting rotation invariance because of the 8-way symmetry/anti-symmetry property of the ZMs. It is seen in Table 5 that the recognition rates on all the datasets are very close to their normal (non-rotated) values. Except for one value of the recognition rate of a total of 25 values for all rotated datasets, all other 24 values are very close to the recognition rate of the normal dataset. Overall, the decrease in the recognition rate is less than 1%. The one value, for which the maximum decrease in recognition rate is 1.40%, pertains to the Caltech-101 dataset for the CH+MZMs features. These results prove the robustness of the proposed descriptor on rotation. Not only the features of the different modalities, but also the MKL approach for the classification of the object, proves to be very effective in determining the high recognition rate under in-plane rotation of the objects. 5.2.3. Scale As explained in Section 5.1, the five scaled datasets are prepared by scaling images with the scaling factors taken randomly from the set {0.50, 0.75, 1.25, 1.5, 1.75}. The results for the scaling are shown in Table 6. It is observed that all the recognition rates on the scaled datasets are very close to the recognition rates on the normal datasets. This proves the robustness of the proposed framework to scale. That is, the discriminative powers of the proposed features and their fusion using the MKL approach are very effective in providing as high recognition rates as under the normal condition. 5.2.4. Comparison with convolutional neural networks (CNNs)-based methods In this section, a comparison is performed between the proposed hand-crafted feature-based global rotation invariant descriptors using MKL classifiers with CNNs-based image classification methods. For the purpose of comparison with CNNsbased methods, AlexNet [16] is chosen as a baseline model which is publically available and is also winner of the ILSVRC 2012 competition. The comparison of recognition rates is shown in Table 7 between the best recognition rates of the proposed descriptors and AlexNet for all five datasets under normal, rotation, and scale. It is observed that the recognition rate

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

149

Table 6 Recognition rates (%) obtained using multiple kernel learning approach on the five scaled datasets for χ 2 distance function. Images are scaled with scaling factors taken randomly from the set {0.5, 0.75, 1.25, 1.5, 1.75}. Databases

Methods

PASCAL VOC 2005 (Shape dominance) Soccer (Color dominance) SIMPLIcity (Color dominance) Flower (Color and shape parity) Caltech-101 (Shape and color co-interference)

Normal Scaled Normal Scaled Normal Scaled Normal Scaled Normal Scaled

CH+ZMs

CH+GZMs

CH+MZMs

CH+ZMs+GZMs

CH+GZMs+MZMs

75.76 75.44 64.29 63.86 84.80 84.80 64.12 64.12 79.00 78.80

78.96 78.86 66.43 65.00 83.00 82.80 67.50 66.50 81.40 81.00

76.20 75.76 62.14 62.00 84.00 83.80 64.41 63.85 79.60 78.40

79.04 78.86 67.86 66.57 84.20 84.20 66.03 65.85 84.60 84.20

79.54 79.44 67.14 66.86 85.20 85.00 66.32 65.59 84.60 84.40

Table 7 Comparison of recognition rate (%) obtained by CNNs-based AlexNet and proposed fusion based methods on five datasets under normal, rotation and scale conditions. Datasets

PASCAL VOC 2005 (Shape dominance) Soccer (Color dominance) SIMPLIcity (Color dominance) Flower (Color and shape parity) Caltech-101 (Shape and color co-interference)

Normal

Rotation

Scale

AlexNet

Proposed

AlexNet

Proposed

AlexNet

Proposed

97.61 85.43 95.40 90.59 99.20

79.54 67.86 85.20 67.50 84.60

85.36 66.86 53.20 65.59 51.40

78.66 67.14 85.00 66.44 84.00

95.44 84.14 93.20 87.56 96.80

79.44 66.86 85.00 66.50 84.40

achieved by AlexNet is significantly higher under the normal and scaling conditions. Its performance decreases significantly under rotation which is as high as 42.20% on the SIMPLIcity dataset from 95.40% to 53.20%. This observation is in conformity with the weaknesses of the CNNs approach as reported in [14]. However, the proposed descriptor performs significantly better under rotation than the AlexNet for all datasets except for PASCAL VOC 2005 dataset where AlexNet still provides better recognition rate. 5.2.5. Time complexity analysis The time complexity analysis consists of the three parts: time taken for feature extraction, computation of the kernel matrix and MKL training. Assuming that the size of the training dataset as T, the features of all T images are required to be computed for the training session. The feature extraction time is different for the various descriptors. Let the size of the feature vector be L, then the kernel matrix k(x, x ) requires the time complexity of the order O(T2 L). To be exact, 12 T (T − 1 ) number of distance values are required to be computed followed by the same number of operations for normalizing them by the average of all pairwise distances. Each distance computation requires O(L) time complexity. Therefore, if there are F modalities, then the computation complexity for all kernels is O(T2 LF). The MKL training also takes a lot of time as it depends on the number of images T and the number of classes K. The prediction phase also goes under the same phases of feature extraction and computation of the kernel matrix k(x, x ). But these two phases are required only for a single image for which the class is to be predicted. Thus, the size of the kernel matrix for the prediction phase is 1 × T. It also depends on the number of classes as it computes the inner product of the kernel vector with the support vectors of each class and returns the class with the minimum value of the inner product. The MKL takes a very long time as compared to the SVM for training as well as for the prediction. The time taken for CH features is the least as its complexity is O(MN) where M × N is the size of the image. The task of feature extraction for the ZMs, MZMs, GZMs and QZMs involves the computation of the ZMs of either a monochrome image (ZMs and GZMs) or the ZMs of each of the three component images (MZMs and QZMs). The GZMs need to compute the gradient magnitude image before computing its ZMs. The order of the time complexity for deriving the gradient image is O(MN). The order of time complexity for deriving ZMs of a component image is O(MN p3max ) if all moments upto a maximum order of pmax are derived. However, there exist fast methods which reduce the time complexity to O(MN p2max ) [36]. Further, with the use of the 8-way symmetry/anti-symmetry, the computation time is reduced to approximately one-eight of its non-symmetric form. Table 8(a) shows the average time taken for feature extraction of one image of the Flower database and Table 8(b) shows the training time for T images with K classes for five databases, and the prediction time of the image for its classification using the MKL. For a method using a single cue, the MKL becomes the SVM classifier. The feature extraction times are given for the single descriptors (CH, ZMs, GZMs, MZMs, RQZMs), for the combinations of two descriptors (CH+ZMs, CH+GZMs, CH+MZMs), and for the three descriptors (CH+GZMs+ZMs, CH+GZMs+MZMs). Since the feature extraction time depends on the size of the image, not on its contents, we choose an image of size 750 × 500 from the Flower database whose size

150

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

Table 8(a) Time taken (in seconds) by various descriptors for feature extraction of one image of Flower dataset of size 750 × 500. Descriptors

CH

ZMs

GZMs

MZMs

RQZMs

CH+ZMs

CH+GZMs

CH+MZMs

CH+GZMs+ZMs

CH+GZMs+MZMs

No. of features Time taken (s)

64 0.007

112 0.048

112 0.076

120 0.042

160 0.044

176 0.055

176 0.083

184 0.049

288 0.131

296 0.125

Table 8(b) Training time of T images with K classes on the five databases and prediction time for one image for MKL classification. Databases

Number of training images (T)

Number of classes (K)

No. of kernels (No. of cues)

PASCAL VOC 2005

684

4

Soccer

140

7

SIMPLIcity

500

10

Flower

680

17

Caltech-101

500

10

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Training time (s) Kernel matrix computation

Weight computation

2.097 4.194 6.291 0.089 0.178 0.267 1.181 2.362 3.543 1.420 2.840 4.260 1.183 2.365 3.549

0.6566 1.1456 1.6034 0.0783 0.1612 0.2059 0.4921 0.7651 1.1859 0.9913 1.8068 2.6914 0.4370 0.8138 1.1421

Prediction time (s) for one image

0.0048 0.0083 0.0125 0.0042 0.0080 0.0105 0.0213 0.0427 0.0631 0.0860 0.1704 0.2565 0.0220 0.0428 0.0633

of images is the largest among images of all other databases. The orders of moment for ZMs and GZMs is pmax = 13, and for the MZMs and QZMs, pmax = 7. The number of histogram bins for the CH features is 64 (4 equal-sized bins for R, G, and B). It is observed in Table 8(a) that the feature extraction time for the CH features is the least which is 0.007(s) and the maximum time is taken by GZMs which is 0.076(s), when single cues are considered. The times taken for the multiple cues are the aggregates of their individual cues. Similarly, the time taken by T number of training images is T times of the time taken by one image. Thus, the minimum time taken for 680 images of the Flower class for training is 4.760(s) for the CH features and the maximum time taken is 51.680(s) for GZMs features. It is shown that the training time taken for the MKL classification phase is very high as compared to the time taken for the prediction phase. It not only depends on the number of training images but also on the number of classes. The training time is further split into two parts: kernel matrix computation and weight computation. Both the times are significant as compared to the time for the prediction. 6. Conclusion We have proposed a fusion of geometrically invariant color, shape and texture descriptors using the MKL approach. First, we introduce an effective texture descriptor which is obtained by deriving ZMs of the magnitude of the gradient of the color images, referred to as the GZMs. The color features are represented by the color histograms (CH). The ZMs, MZMs and RQZMs represent the shape features. The five sets of the features representing color (CH), shape (ZMs, MZMs, RQZMs) and texture (GZMs) are geometrically invariant. The three shape descriptors ZMs, MZMs, and RQZMs, are derived in different ways from color images. The ZMs are derived from the intensity component of a color image. The MZMs and RQZMs represent the early fusion of the color and shape features. They are derived respectively using the multi-channel theory of moments and the quaternion form of the moments. The color, shape and texture features are fused in various combinations and multiple kernel learning (MKL) technique is employed for the classification to enhance the recognition rates of these descriptors. Detailed experimental analysis has been performed on PASCAL VOC 2005, Soccer, SIMPLIcity, Flower and Caltech-101 databases. The following conclusions are made. 1. The proposed texture descriptor GZMs is a very effective descriptor which provides better recognition rates than the shape descriptor (ZMs). Its performance is even at par with MZMs which represent the early fusion of color and shape features on the Caltech-101 dataset, a dataset which possesses color and shape co-interference. 2. The color descriptor is observed to provide the highest recognition rate across all datasets, although it is a wellestablished fact in color object recognition problems. 3. The fusion of color, shape and texture using the MKL classifier is very effective across all datasets except for those which are highly dominated by color. In general, all databases (or all objects) are not likely to be color dominated. Therefore, the proposed approach is very effective for color object recognition under the geometric invariance condition.

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

151

4. Among the three combinations of the two cues – CH+ZMs, CH+GZMs, and CH+MZMs, the combination CH+GZMs provides the overall best recognition rates under all conditions considered here. In fact, the results of CH+GZMs are competitive with the results of the two combinations of the three cues CH+ZMs+GZMs and CH+GZMs+MZMs. Acknowledgments The authors are grateful to the University Grants Commission (UGC), New Delhi, India, for providing financial grants for the Major Research Project entitled, ‘‘Development of Efficient Techniques for Feature Extraction and Classification for Invariant Pattern Matching and Computer Vision Applications”, vide its File No.: 43-275/2014(SR). References [1] H. Bay, T. Tuytelaars, L. Van Gool, Surf: speeded up robust features, in: Proceedings of the European Conference on Computer Vision, 2006, pp. 404–417. [2] S.S. Bucak, R. Jin, A.K Jain, Multiple kernel learning for visual object recognition: a review, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2014) 1354–1369. [3] C. Chang, C. Lin, LIBSVM : a library for support vector machines, ACM Trans. Intell. Syst. Technol. (TIST) 2 (2013) 1–39. [4] B.J. Chen, H.Z. Shu, H. Zhang, G. Chen, C. Toumoulin, J.L. Dillenseger, L.M Luo, Quaternion Zernike moments and their invariants for color image analysis and object recognition, Signal Process. 92 (2012) 308–318. [5] B. Chen, H. Shu, G. Coatrieux, G. Chen, X. Sun, J.L Coatrieux, Color image analysis by quaternion-type moments, J. Math. Imaging Vis. 51 (2015) 124–144. [6] Z. Chen, S Sun, A Zernike Moment phase-based descriptor for local image representation and matching, IEEE Trans. Image Process. 19 (2010) 205–219. [7] C. Chong, P. Raveendran, R Mukundan, The scale invariants of pseudo-Zernike moments, Pattern Anal. Applic. 6 (2003) 176–184, doi:10.1007/ s10 044-0 02-0183-5. [8] S.R. Dubey, S.K. Singh, R.K Singh, Multichannel decoded local binary patterns for content-based image retrieval, IEEE Trans. Image Process. 25 (2016) 4018–4032. [9] P. Gehler, B.C Spemannstr, On feature combination for multiclass object classi cation, in: Proceedings of the International Conference on Computer Vision, 2009, pp. 221–228. [10] R.C. Gonzalez, R.E. Woods, Digital Image Processing, 3rd ed., Prentice Hall, 2008. [11] L. Guo, M. Dai, M Zhu, Quaternion moment and its invariants for color object classification, Inf. Sci. (NY) 273 (2014) 132–143. [12] A.K. Jain, A Vailaya, Image retrieval using color and shape, Pattern Recognit. 29 (1996) 1233–1244. [13] Jeannin S.Ed. MPEG-7 Visual Part of Experimentation Model Version 5.0. ISO/IEC JTC1/SC29/WG11/N3321, Nordwijkerhout, March 20 0 0. [14] H. Kandi, A. Jain, S. Velluva Chathoth, D. Mishra, G.R.K.S Subrahmanyam, Incorporating rotational invariance in convolutional neural network architecture, Pattern Anal. (2018) 1–14 Appl, doi:10.1007/s10044- 018- 0689- 0. [15] E.G. Karakasis, G.A. Papakostas, D.E. Koulouriotis, V.D Tourassis, A unified methodology for computing accurate quaternion color moments and moment invariants, IEEE Trans. Image Process. 23 (2014) 596–611. [16] A. Krizhevsky, I. Sutskever, G.E Hinton, ImageNet classification with deep convolutional neural networks, Adv. Neural Inf. Process. Syst. (2012) 1–9, doi:10.1016/j.protcy.2014.09.007. [17] R. Lan, Y. Zhou, S. Member, Y.Y. Tang, Quaternionic local ranking binary pattern : a local descriptor of color images, IEEE Trans. Image Process. 25 (2016) 566–579. [18] G. Lavoue, A Baskurt, Improving Zernike moments comparison for optimal similarity and rotation angle retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2009) 627–636. [19] S.H. Lee, J.Y. Choi, Y.M. Ro, K.N Plataniotis, Local color vector binary patterns from multichannel face images for face recognition, IEEE Trans. Image Process. 21 (2012) 2347–2353. [20] J. Li, N. Sang, C Gao, Completed local similarity pattern for color image recognition, Neurocomputing 182 (2016) 111–117. [21] S. Li, M.C. Lee, C.M Pun, Complex Zernike moments features for shape-based image retrieval, IEEE Trans. Syst. Man, Cybern. Part A Systems Hum. 39 (2009) 227–237. [22] G.H. Liu, J.Y. Yang, Content-based image retrieval using color difference histogram, Pattern Recognit. 46 (2013) 188–198. [23] P. Liu, J.M. Guo, K. Chamnongthai, H Prasetyo, Fusion of color histogram and LBP-based features for texture image retrieval and classification, Inf. Sci. (Ny). 390 (2017) 95–111. [24] D.G Lowe, Distinctive image features from scale invariant keypoints, Int. J. Comput. Vis. 60 (2) (2004) 91–110. [25] T. Maenpaa, M. Pietikainen, J Viertola, Separating color and pattern information for color texture discrimination, in: Proceedings of 16th International Conference on Pattern Recognition, 1, 2002, pp. 668–671. [26] T. Ojala, M. Pietikäinen, T Mäenpää, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002) 971–987. [27] L. Piras, G Giacinto, Information fusion in content based image retrieval: a comprehensive overview, Inf. Fusion 37 (2017) 50–60. [28] Z. Shao, H. Shu, J. Wu, B. Chen, J Louis, Quaternion Bessel–Fourier moments and their invariant descriptors for object reconstruction and recognition, Pattern Recognit. 47 (2014) 603–611. [29] Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In ICLR 1–14 (2015). doi:10.1016/j.infsof.20 08.09.0 05. [30] C. Singh, A Aggarwal, A noise resistant image matching method using angular radial transform, Digit. Signal Process. A Rev. J. 33 (2014) 116–124. [31] C. Singh, J Singh, Multi-channel versus quaternion orthogonal rotation invariant moments for color image representation, Digit. Signal Process. A Rev. J. 78 (2018) 376–392. [32] C. Singh, J Singh, Quaternion generalized Chebyshev–Fourier and pseudo-Jacobi-Fourier moments for color object recognition, Opt. Laser Technol. 106 (2018) 234–250. [33] C. Singh, R Upneja, Error analysis in the computation of orthogonal rotation invariant moments, J. Math. Imaging Vis. 49 (2014) 251–271. [34] C. Singh, E. Walia, K.P Kaur, Color texture description with novel local binary patterns for effective image retrieval, Pattern Recognit. 76 (2018) 50–68. [35] C. Singh, E. Walia, N Mittal, Rotation invariant complex Zernike moments features and their applications to human face and character recognition, IET Comput. Vis. 5 (2011) 255. [36] C. Singh, E. Walia, R Upneja, Accurate calculation of Zernike moments, Inf. Sci. (Ny) 233 (2013) 255–275. [37] P. Srivastava, A Khare, Integration of wavelet transform, local binary patterns and moments for content-based image retrieval, J. Vis. Commun. Image Represent. 42 (2017) 78–103. [38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1–9, doi:10.1109/CVPR.2015.7298594. [39] K. Van De Sande, T. Gevers, C Snoek, Evaluating color descriptors for object and scene recognition, IEEE Trans. Pattern Anal. Mach. Intell. 32 (2010) 1582–1596. [40] S. Vishwanathan, Z. Sun, N Theera-Ampornpunt, Multiple kernel learning and the SMO algorithm, Adv. Neural Inf. Process. Syst. (2010) 9, doi:10.1145/ 1015330.1015424. [41] C. Wang, X. Wang, Y. Li, Z. Xia, C Zhang, Quaternion polar harmonic Fourier moments for color images, Inf. Sci. (NY). 450 (2018) 141–156.

152

C. Singh and J. Singh / Information Sciences 484 (2019) 135–152

[42] J.Z. Wang, J. Li, G. Wiederhold, SIMPLIcity : semantics-sensitive integrated matching for picture libraries, IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 23 (2001) 947–963. [43] C.-Y. Wee, R Paramesran, On the computational aspects of Zernike moments, Image Vis. Comput. 25 (2007) 967–980. [44] W. Xiang-yang, L. Wei-yi, Y. Hong-ying, N. Pan-pan, L Yong-wei, Invariant quaternion radial harmonic Fourier moments for color image retrieval, Opt. Laser Technol. 66 (2015) 78–88. [45] B. Xiao, J.F. Ma, X Wang, Image analysis by Bessel–Fourier moments, Pattern Recognit. 43 (2010) 2620–2629. [46] M.D. Zeiler, R Fergus, Visualizing and understanding convolutional networks, Comput. Vision–ECCV (2014) 818–833. [47] H. Zhang, H.Z. Shu, P. Haigron, B.S. Li, L.M Luo, Construction of a complete set of orthogonal Fourier–Mellin moment invariants for pattern recognition applications, Image Vis. Comput. 28 (2010) 38–44. [48] C. Zhu, C.E. Bichot, L Chen, Image region description using orthogonal combination of local binary patterns enhanced with color information, Pattern Recognit. 46 (2013) 1949–1963. [49] The Caltech-101 object category data set at http://www.vision.caltech.edu/ImageDatasets/Caltech101/. [50] The Flower data set at http://www.robots.ox.ac.uk/∼vgg/data/flowers/. [51] The PASCAL VOC 2005 data set at http://host.robots.ox.ac.uk/pascal/VOC/databases.html. [52] The Soccer dataset at http://lear.inrialpes.fr/people/vandeweijer/soccer/soccerdata.tar.