Applied Soft Computing Journal 81 (2019) 105474
Contents lists available at ScienceDirect
Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc
A new axiomatic methodology for the image similarity Alessia Amelio DIMES, University of Calabria, Via P. Bucci 44, 87036 Rende (CS), Italy
highlights • • • • •
A new framework with two axiomatic conditions is introduced for image similarity. A similarity measure is derived based on approximate matching of image patches. The similarity measure is compared with other CBIR methods. The experiment is performed on benchmark datasets. Obtained retrieval performances are very promising.
article
info
Article history: Received 14 September 2018 Received in revised form 15 March 2019 Accepted 29 April 2019 Available online 15 May 2019 Keywords: Similarity measure Pattern matching Hamming distance Information theory
a b s t r a c t In this paper, a new theoretical framework is introduced which explores the notion of image similarity in two axiomatic conditions. The first one resembles to the concept of frequency in matching image patches. The second one formalizes the high-level strategy for computing the image similarity. A realization of this theoretical framework is a new image similarity measure based on approximate matching of patches shared between the images. The approximate match is defined in terms of normalized 2D Hamming distance, and verified within a given neighbourhood from the position of the patches in the images. The similarity measure is computed as the average area of these shared patches. The proposed approach is tested on well-known benchmark datasets. The obtained results show that the proposed method overcomes, in terms of retrieval precision and computational complexity, other competing measures adopted in image retrieval. © 2019 Elsevier B.V. All rights reserved.
1. Introduction In the last decades, the Content-based Image Retrieval (CBIR) has received particular attention due to the spread of images and photos shared and browsed on the Web. Accordingly, the need of searching images is rapidly growing because of the ubiquitous access to images from the Internet [1]. Recently, the CBIR has resorted to multiple domains, including security systems, i.e. biometrics [2,3], healthcare systems, to speed up the diagnosis process [4,5], and satellite imaging systems, to provide a concrete support to disasters prevention and Earth monitoring [6,7]. Basically, the aim of CBIR is to retrieve a set of relevant images to a given query image from a large-scale corpus of images [1]. The relevance mainly depends on the semantic class to which the query image belongs. The two key aspects of a CBIR system are: (i) the features representing the image content, and (ii) the (dis)similarity measure adopted for comparing the query image with the corpus of images. According to these two aspects, a CBIR system can be more or less accurate in retrieving the relevant images. In particular, each image in the corpus is represented E-mail address:
[email protected]. https://doi.org/10.1016/j.asoc.2019.04.043 1568-4946/© 2019 Elsevier B.V. All rights reserved.
by a feature vector or descriptor. Retrieval of relevant images is performed by measuring the (dis)similarity between the feature vector of the query image and the feature representations of the images in the corpus. Different approaches have been introduced in the literature for feature extraction and (dis)similarity computation in CBIR systems. Basically, the features have been categorized into: (i) global features, and (ii) local features [1]. Global features aim to represent the total image content into a single representation embedding colour, shape and texture contents [8–10]. By contrast, local features intent to describe local image regions centred at a given point of interest, eventually using multiple descriptors [11,12]. The (dis)similarity measures have been broadly classified as [13]: (i) geometric measures, (ii) information theoretic measures, and (iii) statistical measures. Geometric measures correspond to traditional distance functions applied between image feature vectors representing global or local information [14]. Information theoretic measures compare the pixel intensity distributions in terms of image histograms to compute the similarity between images [15]. Statistical measures compute the correlation between probability distributions of image pixels for similarity evaluation [16].
2
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
Recently, local patch-based measures for image (dis)similarity evaluation revealed particularly interesting in the CBIR scenario, for their ability to capture contextual information from the image. They consist in detecting local patches from the first image, extracting the features from the patches, and verifying their correspondence inside the second image according to a given (dis)similarity criterion [11,17]. The features can be of different type, including raw pixel intensity or colour information. In this context, the Average Common Submatrix (ACSM) method computes the similarity between two images as the average area of the largest square patches exactly matching in the two images at raw pixel level [18,19]. Also, its approximate version, named as Approximate Average Common Submatrix (A-ACSM), employs an approximate matching strategy omitting a portion of pixels at fixed row and column offsets in the patch match [20]. The main limitation of ACSM is the computational complexity of the brute-force algorithm. In the worst case it can be polynomial with exponent up to five in the size of the images on which the similarity is computed. Although a solution based on generalized suffix tree has been introduced for indexing the square image patches [19], the construction of this index requires at least quadratic time and quadratic space in the size of the images. Another limitation is connected with the high number of different values of the image pixels, which may compromise the similarity computation because of the exact match used by ACSM. It may result in small similarity values while the effective variation between the images is small. Although A-ACSM tries to overcome this limitation, it only considers a portion of elements in the patches for the approximate match, which may result in information loss during the similarity computation. In this paper, the aforementioned limitations are overcome by presenting a new theoretical framework based on two axiomatic conditions which formalize and extend the key concepts of ACSM-like similarity measures. This represents an extension of the similarity approach presented in [21]. The first axiomatic condition introduces the concept of frequency in matching the image patches, which is a brand new idea. The second axiomatic condition resembles the general notion of ACSM similarity. Hence, a new approximate similarity measure is introduced following the two axiomatic conditions, which provides three main contributions: (i) a new approximate matching strategy which does not omit any pixel during the match, (ii) a new searching strategy of the image patches in a neighbourhood from their position, which noticeably reduces the temporal cost with no indexing methods, and (iii) a normalized extension of the Hamming distance for the comparison of image patches. To the best of knowledge, any work provided a formalization of the normalized Hamming distance in two dimensions, hence a normalized version of the distance measure in two dimensions has not been provided yet. Experiments performed on well-known benchmark datasets show very promising results of the new similarity measure in CBIR. In particular, the proposed similarity approach proves its competitiveness in terms of computational complexity and performance with ACSM, A-ACSM, and other well-known local patchbased methods widely adopted in CBIR. The paper is organized as follows. Section 2 describes the related work in CBIR. Section 3 presents the new axiomatic framework for image similarity. Section 4 introduces the new image similarity measure, i.e. the algorithm, its main features and an application example. Section 5 describes the experimental setting, evaluation methodology and obtained results. Section 6 provides a discussion about the obtained results. In the end, Section 7 draws conclusions and future work direction.
2. Related works in content based image retrieval According to the approach direction, the feature representation concerning CBIR can be partitioned into global and local features. Both approaches have their advantages and disadvantages. Clearly, it depends on the problem that is solving which approach is better suited. Also, it is worth considering the computer time consuming complexity of each approach. Global features. Wengert et al. [8] introduced a colour signature, called bag-of-colours, which was based on the generation of an image colour histogram starting from a specifically designed colour codebook. Furthermore, the histogram was normalized and the power-law method was applied for regularizing the contribution of each colour in the final descriptor. Nazir et al. [22] presented a new CBIR approach to merge the colour and texture feature representation. Specifically, the colour content is provided by the Colour Histogram (CH), while the textural content is extracted by the Discrete Wavelet Transform (DWT) and Edge Histogram Descriptor (EDH). Ali et al. [23] proposed two new image representation approaches based on the histograms of triangles, aiming to add spatial information to the bag-offeatures representation. The input image is split into two and four triangles which are separately considered in order to compute the histograms of triangles at two levels. Also, the same author in [24] presented a new method for embedding relative spatial information in the histogram of the bag-of-features, based on the evaluation of the global geometric relationship between couples of equal visual words and their spatial distribution by considering a centroid in the image as the reference point. In [25], this geometric relationship was calculated between triplets of equal visual words via computation of an orthogonal vector in correspondence of each point in these triples. The magnitude of the orthogonal vectors defines the histogram of visual words which is robust to transformations e.g. rotation. In [26], the spatial information in the histogram of the bag-of-features is extracted by computing the histogram of circles, whereas rotations are managed by employing concentric circles. The histogram is extracted from the image by adopting a weighted circle scheme. In [27], a Hybrid Geometric Spatial Image Representation (HGSIR) computed the bag-of-features representation by combining the histograms extracted from rectangular, triangular and circular sub-regions of the image in order to preserve the spatial information. Wang et al. [10] proposed a new image descriptor based on the integration of colour and texture contents. Colour was represented by pseudo-Zernike chromaticity distribution moments in opponent chromaticity space, while texture was represented by a rotation-invariant and scale-invariant descriptor in steerable pyramid domain. Wang et al. [28] introduced a compact image descriptor for retrieving duplicate images in a large-scale image corpus. The descriptor was based on the generation of a k-bit hash code from the image content. Siagian and Itti [29] proposed a context-based scene recognition algorithm for mobile robotic applications based on the extraction of multiscale visual features from the scenes. The gist of the scene was captured and represented inside a low-dimensional feature vector. Wang and Hua [30] employed a new image search system where image descriptors were represented as colour maps. From each colour map, a set of dominant colours was extracted for filtering the image search results. Local features. Lowe [11] introduced the SIFT descriptors for efficiently detecting local image features through a filtering procedure finding key points in the scale space. They were computed by modelling blurred image gradients in different orientations and scales by using Difference of Gaussians (DoG) as approximation of Laplacian of Gaussian (LoG). Bay et al. [12] proposed
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
the SURF descriptors, which approximated LoG with box filter obtained by using the integral images. Non-maximum suppression was adopted for finding the key points, and wavelet responses were used for feature representation. Ali et al. [31] proposed a feature representation which is a visual word integration of SIFT and SURF for embedding the robustness to changes in scale and rotation from SIFT and the robustness to illumination changes from SURF. Calonder et al. [32] employed the BRIEF binary descriptor, which extracted binary features from the local patches of interest. Basically, binary strings were generated from each local patch as feature point descriptor using intensity difference tests. Zhang et al. [33] proposed the USB descriptor, which extracted an ultra short binary representation and a compact spatial feature from each key point detected in the image. It captured the visual characteristics and spatial configuration in the neighbourhood of the key point. Barnes et al. [17] introduced the raw pixel intensity of the local image patches as feature representation to perform approximate matching between the image patches. The different global and local descriptors require the application of some (dis)similarity measure for their comparison in a CBIR system. The state-of-the-art (dis)similarity measures can be broadly divided into geometric, information theoretic and statistical measures, according to their constituent characteristics. Geometric measures. They include the Lp -normalized distance, which is the Minkowski distance, between two fixed size image feature vectors [1]. It can be reduced to: (i) L1 -distance, which is the Manhattan distance, when p = 1, (ii) L2 -distance, which is the Euclidean distance, when p = 2, and (iii) L∞ distance, which is the Chebyshev distance, when p = ∞. Different extensions and applications of the current measures have been proposed in the literature. Jégou et al. [34] introduced the contextual dissimilarity measure which considered the local distribution of the feature vectors and iteratively updated the distance according to the Sinkhorn’s scaling algorithm in order to change the neighbourhood composition. Kong et al. [35] proposed the Manhattan hashing. It was based on the Manhattan distance computed between points in the hashing space, used for measuring the similarity between binary-code representations of the images. Xu and Zhang [36] introduced an image retrieval system using a combination of Local Binary Pattern image descriptors and Euclidean distance for medical images. Amato et al. [37] measured the dissimilarity of local SIFT descriptors in two images as the average Euclidean distance between the local descriptors in the first image and their nearest neighbour descriptors in the second image. Information theoretic measures. They are based on the concept of entropy as defined by Shannon [38], representing a measure of uncertainty or complexity of an image. The entropy quantifies the dispersion of the probability distribution of the image intensity levels. Accordingly, images with more uniform image intensity levels distribution exhibit higher dispersion, and consequently higher entropy. By contrast, images with large peaks in the image intensity levels distribution exhibit lower dispersion, and consequently lower entropy. Tourassi and Harrawood [39] analysed eight information theoretic (dis)similarity measures which were employed in medical image retrieval systems. They were: (i) joint entropy, (ii) conditional entropy, (iii) mutual information, (iv) normalized mutual information, (v) average Kullback–Leibler divergence, (vi) maximum Kullback–Leibler divergence, (vii) Jensen divergence and, (viii) arithmetic -geometric mean divergence. Zachary [40] introduced the maximum relative entropy measure, which was an extension of the Kullback–Leibler divergence satisfying identity and symmetry properties, in order to compare image feature vectors of colour
3
histograms representing joint probability density functions. Murayama et al. [41] proposed a divergence similarity measure based on Jensen–Shannon divergence which compared colour images in terms of similarity of colour distribution. Pushpalatha and Ananthanarayana [42] introduced an information theoretic measure for computing the similarity between multimedia documents of multimodal objects which were represented as signal objects to capture the correlations between the contents of multimedia documents. Statistical measures. They are based on the comparison of probability distributions of image pixels. Rahman et al. [43] introduced an image similarity method based on the distribution of joint feature vectors of colour and texture characteristics. From the feature distribution of training images, mean vectors and covariance matrices were computed under the assumption of Multivariate Gaussian distribution. Furthermore, they were used by statistical distance measures to minimize the retrieval error probability. Zhang and Lu [44] evaluated different measures for image (dis)similarity in CBIR, including the χ 2 measure. They found that Manhattan distance and χ 2 measure can outperform the other (dis)similarity measures in terms of retrieval accuracy and efficiency. Cho et al. [45] compared different (dis)similarity measures for CBIR of breast masses on ultrasound images, including the Pearson’s correlation coefficient measure. In the last years, different methods based on deep learning have been also introduced for learning feature representations in CBIR. In particular, Wang et al. [9] employed a combination of two different models of convolutional neural networks for learning the shape representations for sketch-based shape retrieval, one for the sketches and the other for the views. Tzelepi and Tefas [46] proposed a model retraining approach, which adopts a convolutional neural network model to generate the feature descriptors from the activations of the convolutional layers. It follows adaptation and network retraining for generating more compact feature representations in order to improve the retrieval efficiency and performance. Saritha et al. [47] introduced a deep learning approach based on Deep Belief Networks (DBN) which train from large amount of data for the extraction of meaningful feature representations in CBIR. Qayyum et al. [48] proposed a convolutional neural network model for learning suitable feature representations in order to classify 2D slices of body parts from 3D medical images. Bai et al. [49] presented a method for improving the performances of the deep convolutional neural network AlexNet in extraction of the feature representation and similarity efficiency for large scale image retrieval. Finally, Liu et al. [50] introduced a new method which merges two different deep convolutional feature representations from an improved network architecture LeNet-5 and AlexNet to obtain better performance in CBIR. 3. Axiomatic conditions for image similarity In the following, an axiomatic approach to the similarity of images is presented. It realizes a theoretical framework which defines a new notion of image similarity formalized by two axiomatic conditions. This framework resorts to the axiomatic approach introduced by Jacob Ziv in [51] for the similarity of individual sequences, and provides its revisiting for an important extension in image processing. In the rest of the paper, it is assumed that an image of area I × I is represented by a matrix of the same area and that patches are square matrices of size less than I, also called sub-matrices. ˆ Let X and Y be two square matrices of size N × N and Nˆ × N, respectively, defined on the same alphabet Σ . It is assumed to compare X and Y , where Y is the reference square matrix with respect to which the comparison is performed.
4
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
It is considered the mapping of square sub-matrices Yid,j , where i, j is the row,column position in Y where the sub-matrix occurs and d is its size, to a set S(N 2 , Y ) of C (N 2 , Y ) features characterizing Y . Using these features, the similarity of Y to X is evaluated. Accordingly, this mapping can be expressed as: Yid,j → {0, . . . , C (N 2 , Y )},
i, j, d = 1, . . . , Nˆ ,
(1)
where 0 indicates that some sub-matrix of Y cannot be mapped to any feature. Given the feature set S(N 2 , Y ) which characterizes Y , and the matrix X , let 0 ≤ p1 ≤ 1 be the fraction of positions i, j in X such that exists an element of S(N 2 , Y ) which is function of the sub-matrix Xid,j . As multiple positions i, j in X can be associated with the same feature in S(N 2 , Y ), let 0 ≤ p2 ≤ 1 be the fraction of distinct elements of S(N 2 , Y ) among the N 2 p1 positions of X . Axiomatic Condition 1. The matrix X must be considered as dissimilar to Y if min{p1 , p2 } < p0 , where p0 is a threshold parameter. Also, if X = Y , the two matrices must be considered as similar ∀ 0 ≤ p0 ≤ 1. The Axiomatic Condition 1 requires the number of positions i, j in X with a correspondence in the feature set S(N 2 , Y ) to linearly grow with the area of the matrix X . Hence, if p1 is small and X contains N 2 p1 positions having a correspondence in the set S(N 2 , Y ) and the remaining N 2 (1 − p1 ) positions with any correspondence in the set S(N 2 , Y ), X must be considered as dissimilar to Y . On the other hand, if p1 is high but p2 is low, only a small subset of distinct features in S(N 2 , Y ) has a correspondence with the N 2 p1 positions in X . Consequently, X must be considered as dissimilar to Y . Example 3.1. Fig. 1 shows an application example of the Axiomatic Condition 1. The feature set S(N 2 , Y ) is characterized by 12 elements, hence C (N 2 , Y ) = 12. The two square sub-matrices Y11,1 and Y11,3 in Y are mapped to the features s1 and s3 . Also, the feature s1 is function of the sub-matrix X11,1 , whereas the feature s3 is function of the sub-matrices X11,2 and X11,3 . It is worth noting that p1 is computed as the number of positions i, j in X with a correspondence in S(N 2 , Y ), which is 3, divided by the number of total positions in X , which is 16. Also, p2 is computed as the number of distinct features in S(N 2 , Y ), which is 2 (s1 and s3 ), divided by the number of positions i, j in X with a correspondence in S(N 2 , Y ), which is 3 (position 1,1, position 1,2, and position 1,3). The value of p0 is set to 0.5. Consequently, X and Y are considered as dissimilar. In spite of 2 out of 3 positions in X involved in distinct features (p2 = 2/3 = 0.67), the fraction of positions i, j in X with a correspondence in S(N 2 , Y ) is too small (p1 = 3/16 = 0.19). The Axiomatic Condition 1 is used as a criterion for evaluating the matching of two square matrices based on the frequency of distinct features which are in common in the two matrices [51]. This defines the fine-grain criterion on which the similarity between two images is formalized according to the Axiomatic Condition 2. Let A, B, and C be square matrices of size respectively M × M, ˆ ×M ˆ and P × P, defined on the same alphabet Σ . M Axiomatic Condition 2. A is considered as more similar to B than to C if the average area of the square sub-matrices in A matching inside B according to the Axiomatic Condition 1 is larger than the same average area between A and C . Example 3.2. Fig. 2 shows that A can be considered as more similar to B than to C , since it has on average larger sub-matrices in common with B than with C .
It is worth noting that any assumption is made about the portion of B or C where the square sub-matrices of A are matched. Also, if the match between the square sub-matrices were defined regardless of the Axiomatic Condition 1, and the match were verified within the whole B, this would be exactly the criterion introduced in ACSM for the image similarity evaluation [18,19]. 4. The E -average common submatrix In the following, a realization of the theoretical framework featured by the two axiomatic conditions is presented, which is the E -Average Common Submatrix image similarity measure (E ACSM) [21]. It is characterized by three main contributions: (i) the area where matching the square sub-matrices is reduced to a neighbourhood of extension ϵ in the matrix (ϵ -neighbourhood), where ϵ is an input parameter, (ii) the match of square submatrices is approximated according to a threshold value τ which considers the contribution of all elements in the sub-matrices, and (iii) the approximate matching resorts to the notion of Hamming distance between the square sub-matrices. In these three aspects, E -ACSM completely differentiates from the baseline formulation of ACSM given in [18,19]. Let A and B be the two square matrices of size respectively ˆ ×M ˆ defined on the same alphabet Σ . For each M × M and M position i, j in A, the E -ACSM finds the largest square sub-matrix in A matching a square sub-matrix at some position h, k inside a given neighbourhood of extension ϵ in B. The area of these largest square sub-matrices is considered and summed. At the end, the similarity between A and B is computed as the average area over the area of A. This concept recalls the Axiomatic Condition 2. The E -ACSM similarity measure is defined as follows: Sα (A, B) =
M M 1 ∑∑
M2
W (i, j), W (i, j) ≥ α.
(2)
i=1 j=1
The parameter α sets the minimum area of the square submatrices to consider in the similarity computation, and W (i, j) is the area of the largest square sub-matrix at position i, j in A matching a square sub-matrix at some position h, k inside a given neighbourhood of extension ϵ in B. The matching will be performed according to the Axiomatic Condition 1. 4.1. The algorithm Let Adi,j be a square sub-matrix with bottom-right corner at position i, j in A and area d × d. The procedure for computing the E -ACSM similarity measure is reported in the Algorithm 1. The algorithm scans the matrix A (steps 2–4). For each position i, j in A, it searches the largest square sub-matrix of area bigger than or equal to α matching inside a given neighbourhood from i, j of extension ϵ in B (steps 5–12). First, it verifies the match of the square sub-matrix of maximal area min{i, j}2 (step 5), provided that its area is bigger than or equal to α (steps 6–7). If the match is found, the area of this square sub-matrix is considered and the search is stopped (step 8). Otherwise, the match of increasingly smaller square sub-matrices at i, j is verified until the match is found or the area of the square sub-matrix is smaller than α (step 10). If a match is not found at i, j, the contribution of that position to the similarity measure is zero (steps 13–15). The area of the largest square sub-matrices at different positions in A is progressively summed (step 16). In the end, this sum is averaged over the area of A (step 20), and returned as the E -ACSM similarity measure (step 21). For a better understanding of the approach, Fig. 3 shows the flowchart of the Algorithm 1.
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
5
Fig. 1. Application example of Axiomatic Condition 1.
Fig. 2. Application example of Axiomatic Condition 2. The sub-matrices matching in A and B are blue coloured, whereas those matching in A and C are green coloured.
Fig. 3. Flowchart of Algorithm 1.
6
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
Algorithm 1 The E -ACSM similarity measure Require: A, B, α , ϵ 1: Wtot := 0, d := 0, i := 1 2: while i ≤ M do 3: j := 1 4: while j ≤ M do 5: W (i, j) := 0√ , d := min{i, j}, flag := true 6: while d ≥ α and flag do 7: if HM(Adi,j , B, i, j, ϵ ) then 8: flag = false 9: else 10: d=d−1 11: end if 12: end while 13: if ¬ flag then 14: W (i, j) = d2 15: end if 16: Wtot = Wtot + W (i, j), j = j + 1 17: end while 18: i=i+1 19: end while 2 20: Sα (A, B) = Wtot /M 21: return Sα (A, B)
Fig. 5. Square sub-matrices at position 4,4 in A and the 1-neighbourhood from position 4,4 in B. Sub-matrices which can be matched inside the 1-neighbourhood are yellow coloured.
match is verified inside an ϵ -neighbourhood from i, j in B. The ϵ parameter is firmly connected with the maximum area of the square sub-matrices to be considered in the matching process. An ϵ -neighbourhood from i, j is a square surface of area (2ϵ + 1) × (2ϵ + 1) where i, j is the reference point and ϵ is a parameter. Accordingly, two different cases may occur: 1. Adi,j is such that d > (2ϵ + 1). In this case, Adi,j cannot have any match inside the ϵ -neighbourhood from i, j in B, 2. Adi,j is such that d ≤ (2ϵ + 1). In this case, the match of Adi,j can be verified inside the ϵ -neighbourhood from i, j in B.
ˆ ∧ (i −ϵ ≥ 1) ∧ (j +ϵ ≤ M) ˆ ∧ (j −ϵ ≥ If we assume that (i +ϵ ≤ M) 1), the ϵ -neighbourhood from i, j does not exceed the size of B on any side.
Fig. 4. Extraction of square sub-matrices from position 3,4 in the matrix A to be matched inside the matrix B.
Example 4.1. Fig. 4 shows the extraction of increasingly smaller square sub-matrices from position 3,4 in A to be matched inside B. The square sub-matrix A33,4 has the maximal area min{3, 4}2 = 32 = 9 at position 3,4. All other square sub-matrices (A23,4 and A13,4 ) start at same position 3,4 but they have a smaller area respectively of 4 and 1. If the α parameter is set to 4, the smallest sub-matrix A13,4 cannot be considered in the matching process. The core of Algorithm 1 is the matching procedure HM(Adi,j , B, i, j, ϵ ) which evaluates the matching of the square sub-matrix Adi,j inside a given neighbourhood from i, j of extension ϵ inside B according to the Axiomatic Condition 1. The main points to be analysed on the matching procedure are: (i) the notion of ϵ neighbourhood, and (ii) how matching a square sub-matrix inside the ϵ -neighbourhood. 4.1.1. Restricting the search space in terms of neighbourhood The notion of ϵ -neighbourhood is based on the assumption that a positive match of square sub-matrices in the same neighbourhood in A and B is a sufficient condition for stating that A and B are similar. In this way, the match of a square sub-matrix Adi,j in A is no more verified inside overall B. By contrast, the
Example 4.2. Fig. 5 shows the square sub-matrices at position 4,4 in A (dotted lines and different colours) and the 1-neighbourhood from position 4,4 in B of area 3 × 3 (yellow coloured). It is worth noting that the 1-neighbourhood from 4,4 does not exceed the size of B on any side ((i + ϵ = 4 + 1 ≤ 5) ∧ (i − ϵ = 4 − 1 ≥ 1) ∧ (j +ϵ = 4 + 1 ≤ 5) ∧ (j −ϵ = 4 − 1 ≥ 1)). Consequently, the reference point at position 4,4 is at the centre of the area (orange coloured). Since the size of A44,4 (d = 4) is larger than 3, it cannot match at any position inside the 1-neighbourhood (first case). By contrast, the match of A34,4 , A24,4 , and A14,4 can be verified inside the 1-neighbourhood from 4,4 in B, because for each considered sub-matrix, d ≤ 3 (second case). It is assumed that α = 1.
ˆ ∨ (i −ϵ < 1) ∨ (j +ϵ > By contrast, if we assume that (i +ϵ > M) ˆ ∨ (j − ϵ < 1), the ϵ -neighbourhood from i, j exceeds the size M) of B on one or more sides. Example 4.3. Fig. 6 shows the square sub-matrices at position 4,4 in A (dotted lines and different colours) and the 2-neighbourhood from position 4,4 in B (yellow coloured). It is worth noting that the 2-neighbourhood has area 5 × 5 and it exceeds the size of B on the right (j + ϵ = 4 + 2 > 5) and on the bottom (i + ϵ = 4 + 2 > 5). Consequently, the reference point at position 4,4 is not at the centre of the area (orange coloured). Also, the match of all squared sub-matrices can be verified inside the 2-neighbourhood from 4,4 in B, because for each of them, d < 5 (second case). It is assumed that α = 1. An important result is given by Theorem 4.1, which proves that, regardless of the ϵ value, E -ACSM is able to capture all the common patterns in the two matrices.
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
7
Fig. 8. Sub-matrices A33,3 (dotted line) and A44,4 (continuous line). The 1neighbourhoods from 4,4 and 3,3 in B are yellow and green coloured, respectively. A portion which is not included in the 1-neighbourhood from 4,4 in B is green coloured with lines.
Fig. 6. Square sub-matrices at position 4,4 in A and the 2-neighbourhood from position 4,4 inside B. Sub-matrices which can be matched inside the 2-neighbourhood are yellow coloured.
Fig. 9. The largest square sub-matrix at position 4,4 in A matching inside the 3-neighbourhood from 4,4 in B is A44,4 of maximal area 16. The 3-neighbourhood is yellow coloured.
Fig. 7. Sub-matrices A44,4 (continuous line) and A34,4 1-neighbourhood from 4,4 in B is yellow coloured.
(dotted line). The
Theorem 4.1 (Optimality of E -ACSM). The E -ACSM optimally verifies the match of the square sub-matrices for the similarity computation, regardless of the ϵ value and the kind of matching strategy. Proof. Let Adi,j be a square sub-matrix such that d > (2ϵ + 1) (first case, where Adi,j cannot have any match inside the ϵ neighbourhood from i, j in B). However, Adi,j can match outside the
ϵ -neighbourhood from i, j in B. Also, Adi,−j 1 is such that d = (2ϵ + 1) and can match inside the ϵ -neighbourhood from i, j in B (second
case). Since Adi,j can match in B, all its square sub-matrices can match in B. These also contain the portions which were not included in the ϵ -neighbourhood. But each of these square sub-matrices should have been previously verified or shall be verified later for matching inside the corresponding ϵ -neighbourhood, because of the sequential scan of A. It is easy to understand that this procedure holds regardless of the ϵ value and the kind of adopted matching criterion. □ Example 4.4. Fig. 7 shows that the match of the square submatrix A44,4 can be verified outside the 1-neighbourhood from 4,4 in B but not inside the 1-neighbourhood. By contrast, the match of the square sub-matrix A34,4 can totally be verified in the 1-neighbourhood. Fig. 8 shows the 1-neighbourhood from 4,4 in B (yellow coloured) and a portion outside the 1-neighbourhood which is not considered for the match (green coloured with lines). However, it is observable that the match of this portion has been previously verified in the 1-neighbourhood from 3,3 in B (green coloured), when the square sub-matrix A33,3 was considered in A (dotted line). Matching the square sub-matrices inside the ϵ -neighbourhood has an impact on the bounds of the image similarity measure. Let α be equal to 1. The following theorem holds for exact match.
Fig. 10. The largest square sub-matrix at position 4,4 in A matching inside the 1-neighbourhood from 4,4 in B is A24,4 of area 4. The 1-neighbourhood is yellow coloured.
Theorem 4.2 (Bounds of E -ACSM). E -ACSM computed between two matrices B is bounded between 0 (total mismatch) and γ = ∑M A∑and M 1 i=1 j=1 β (total match), where M is the size of A and β is M2 defined as follows:
β=
min{i, j}2 , if min{i, j} ≤ (ϵ + 1)
{
(ϵ + 1)2 ,
other w ise.
(3)
Proof. Let A and B be identical matrices. The largest square submatrix at position i, j in A matching inside the ϵ -neighbourhood from i, j in B is that of maximal area min{i, j}2 only if the submatrix can be all included in the ϵ -neighbourhood. It can only occur when min{i, j} ≤ (ϵ + 1). Otherwise, the area of the largest square sub-matrix at position i, j in A is bounded by the ϵ value, since the portions outside the ϵ -neighbourhood cannot be considered in the match. Since A and B are identical, this holds for this kind of adopted matching criterion. □ Example 4.5. Fig. 9 shows that the largest square sub-matrix at position 4,4 in A matching inside the 3-neighbourhood from 4,4 in B is A44,4 of maximal area min{4, 4}2 = 16, since it can be included in the 3-neighbourhood (min{4, 4} = (3 + 1) = 4). Fig. 10 shows that the largest square sub-matrix at position 4,4 in A matching inside the 1-neighbourhood from 4,4 in B is A24,4 of area d × d = (ϵ + 1)2 = 4, since it is bounded by ϵ = 1 (min{4, 4} > (1 + 1) = 2).
8
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
4.1.2. Approximate threshold-based pattern matching An approximate approach is proposed for verifying the match of two square sub-matrices in the ϵ -neighbourhood. Specifically, let Adi,j , Bdh,k be two square sub-matrices belonging respectively to matrices A and B. A negative match between Adi,j and Bdh,k is obtained when their similarity value is below an established threshold: S (Adi,j , Bdh,k ) < τ ,
(4)
where 0 ≤ τ ≤ 1 is the threshold parameter and S can be defined as follows:
ˆ (Adi,j , Bdh,k ), S (Adi,j , Bdh,k ) = 1 − D
(5)
where Dˆ : X × X → [0, 1] is a distance function over the elements X of the space of the square sub-matrices. Consequently, S ranges in [0, 1], where 0 represents the total mismatch, while 1 is the total match. Pattern frequency by 2D normalized Hamming distance. The ˆ which resembles the notion of approximate distance function D two-dimensional pattern matching is a two-dimensional version of the Hamming distance, normalized by the area of the sub-matrices. In information theory, the Hamming distance between a pair of strings of the same length denotes the number of elements differing at the same positions in the two strings [52]. Also, it is known that the Hamming distance is a metric on the set of strings of a given length (Hamming space), because it satisfies the properties of: (i) non-negativity, (ii) symmetry, and (iii) triangle inequality. The definition of Hamming distance between two strings was extended and it was presented the Hamming distance between two matrices X and Y of the same area N × M (where N can be equal to M) as the number of elements differing at the same positions in the two matrices [53]. Definition 4.1 (2D Hamming Distance). The Hamming distance between two matrices X and Y both of area N × M is defined as follows: D(X , Y ) =
N M ∑ ∑
F (X (i, j), Y (i, j)),
(6)
i=1 j=1
where: F (a, b) =
{
1,
if a ̸ = b
0,
otherwise .
(7)
It is shown that the 2D Hamming distance is still a metric on the set of matrices of a given area, because the three properties still hold. Theorem 4.3. The 2D Hamming distance is a metric on the set of the matrices X (N ×M) of a given area N × M. Accordingly, it holds that: 1. (D(X , Y ) ≥ 0) ∧ (D(X , Y ) = 0) ⇔ X = Y (non-negativity), 2. D(X , Y ) = D(Y , X ) (symmetry), 3. D(X , Z ) ≤ D(X , Y ) + D(Y , Z ), ∀X , Y , Z ∈ X (N ×M) (triangle inequality). Proof. 1. D(X , Y ) = 0 if and only if X and Y do not differ in any corresponding position. But this happens if and only if X = Y. 2. The number of positions where X differs from Y is the same where Y differs from X .
Fig. 11. Sub-matrices A34,4 in A and B34,5 (green lines) inside the 2-neighbourhood from 4,4 in B (yellow coloured).
3. D(X , Y ) is the number of positions where X differs from Y . It corresponds to the number of substitutions needed to transform X into Y . The same is for D(Y , Z ). Hence, D(X , Y ) + D(Y , Z ) is the number of substitutions needed to transform X into Z . Consequently, it holds that D(X , Y ) + D(Y , Z ) ≥ D(X , Z ). □ At the end, the 2D Hamming distance in Eq. (6) is normalized over the area of the matrix, for weighting the differences according to the area. The normalized 2D Hamming distance Dˆ ranges between 0 (total match) and 1 (total mismatch):
ˆ (X , Y ) = D
D(X , Y )
N ×M
.
(8)
Example 4.6. Fig. 11 shows the sub-matrices A34,4 in A and B34,5 inside the 2-neighbourhood from 4,4 in B. The differences are grey coloured. The Hamming distance D(A34,4 , B34,5 ), corresponding to the number of differences, is equal to 4. Consequently, the normalized 4 = 0.45. 2D Hamming distance Dˆ (A34,4 , B34,5 ) is computed as 3× 3 In the following, it is proved that the notion of similarity as defined in Eq. (4) fully complies with the Axiomatic Condition 1. Theorem 4.4. The notion of similarity S fully complies with the Axiomatic Condition 1. Proof. Let X and Y be two square matrices of area N × N. Resorting to Axiomatic Condition 1, p1 is the fraction of positions i, j in X such that exists an element of the feature set S(N 2 , Y ) characterizing Y which is function of the sub-matrix Xid,j . Let d be equal to 1. Also, let Yi1,j , i, j = 1, . . . , N be the elements of the feature set S(N 2 , Y ). Then, p1 is the fraction of positions i, j in X such that exists an element Yi1,j which is function of the element Xi1,j . This can correspond to S (X , Y ) = 1 − Dˆ (X , Y ), which is defined as the fraction of positions such that exists an element which has the same value in X and Y . Since a position i, j in X is associated to a single element Yi1,j , it follows that p2 = 1. Also, the p0 parameter corresponds to the threshold parameter τ . □ Example 4.7. Fig. 12 shows a sample of feature set S(N 2 , Y ) with the different elements Yi1,j , i, j = 1, . . . , 4, and the correspondence between the element Y21,3 at position 2,3, and the element X21,3 at the same position 2,3.
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
9
Fig. 12. Correspondence between the element Y21,3 at position 2,3, and the element X21,3 at the same position 2,3.
4.2. Computational complexity The reduction of the matching space to the ϵ -neighbourhood and the approximate pattern matching determine the following result in time complexity of Algorithm 1 (E -ACSM). Theorem 4.5. Given two matrices A and B both of size M × M, the algorithm for computing E -ACSM between A and B takes O(M 2 ϵ 3 k) time. It can be further reduced to O(M 2 log2 ϵϵ 2 k) by using a binary search on A. Proof. The matching based on the notion of τ -thresholded similarity as defined in Eq. (5) fully complies with the problem of approximate two-dimensional pattern matching with at most k differences [54,55]. Searching for a square sub-matrix Adi,j of A with at most k differences inside B can be solved in O(M 2 k) time, which is linear in the area of the matrix B [55]. The Algorithm 1 scans the matrix A position by position, for M 2 positions. In the worst case, for each position i, j, it verifies the match of 2ϵ + 1 square sub-matrices inside the ϵ -neighbourhood from i, j in B (in the worst case, the ϵ -neighbourhood is of area (2ϵ + 1) × (2ϵ + 1)). Since verifying for an approximate match inside the ϵ -neighbourhood takes O(ϵ 2 k) time, a brute force approach takes O(M 2 ϵ 3 k) time. By using a binary search strategy for selection of the square sub-matrix on d (see Algorithm 1) inside the matrix A [18], the algorithm takes O(M 2 log2 ϵϵ 2 k). □ Considering that ϵ ≪ M and k is much smaller than the area of the sub-matrix, the Algorithm 1 is faster than ACSM, which takes O(M 5 ) if the two matrices A and B are of the same size [18]. A similar reduction is obtained when the binary search is used, considering that ACSM takes O(M 4 log2 M) if the two matrices A and B are of the same size [18]. 4.3. Example of application In the following, a sample of E -ACSM computation is shown. Specifically, Fig. 13 reports the largest square sub-matrices of area greater than or equal to 4 (α = 4) at the different positions i, j in the matrix A matching some square sub-matrix inside the 2neighbourhood (ϵ = 2) from i, j in B. The value of the threshold parameter τ for the similarity computation is set to 0.80. It can be observed that some positions in A are not associated with any square sub-matrix. For example, position 3,5 in A determines the two square sub-matrices A33,5 and A23,5 . The smallest square sub-matrix A13,5 is not considered because its area is less than α . It can be observed that A33,5 does not match inside the 2-neighbourhood with reference point 3,5 in B, because its similarity with the possible square sub-matrices is always lower than 0.80. The same is for A23,5 .
Fig. 13. The largest common square sub-matrices between A and B.
At the end, the E -ACSM similarity measure is computed as the average area of the extracted largest common square submatrices: S4 (A, B) =
=
9+9+4+4+4+4+4+4+4 25 46 25
= 1.84.
(9)
5. Datasets and performance evaluation In the following, a description of the datasets is provided, together with measures used for evaluating the approach, competing methods and performance analysis for the validation of the proposed similarity measure. The all experiment has been performed in Matlab R2017a also using the VLFeat library [56] v. 0.9.21 and Eclipse IDE for Java Developers v. 1.5.1 and Java-SE 1.6. 5.1. Datasets description In order to evaluate the proposed approach for similarity in CBIR, experiments are performed on three benchmark datasets in the context of biometric systems, handwritten character recognition and object recognition. 5.1.1. IIT Delhi Touchless Palmprint Dataset The first adopted dataset is the IIT Delhi Touchless Palmprint Dataset (Version 1.0) for hand biometrics.1 All the images are selected in the indoor environment and collected by using a circular fluorescent illumination around the camera lens. The dataset is composed of images from 235 human subjects aged between 12 and 57 years. Images from each subject, acquired in different hand pose variations in bitmap format, represent the left and right hands. The size of each image is 800 × 600 pixels. Starting from the original image dataset, a total of 90 images of 12 subjects is randomly selected. Each image is 128 × 128 automatically cropped and normalized. At the end, each image is processed in order to extract the hand contours together with the hand inner lines. Hence, the hand contour images represent the input to the retrieval methods. Fig. 14 shows a sample of hand images extracted from the IIT Delhi Touchless Palmprint dataset. 1 The original image dataset is available at: http://www4.comp.polyu.edu.hk/ ~csajaykr/IITD/Database_Palm.htm.
10
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
Fig. 14. Sample of hand images from the IIT Delhi Touchless Palmprint Dataset.
Fig. 15. Sample of alphabet letters images from the NIST Special 19 dataset.
5.1.2. NIST Special 19 dataset The second dataset used for the experiment is the NIST Special 19 Dataset.2 NIST is characterized by binary images of 3669 handwriting sample forms and 814′ 255 digit, uppercase and lowercase alphabet letter images segmented from those forms. Each segmented image has an area of 128 × 128 pixels. All images are divided into different classes for multiple tasks of pattern recognition: (i) by writer, (ii) by class, (iii) by case-less class (no distinction between uppercase and lowercase letters), and (iv) by field origin. For the experiment, a subset of 217 images with 7 alphabet letters is randomly selected. Fig. 15 shows sample images from the NIST Special 19 dataset. 5.1.3. Caltech-256 object category dataset The third dataset used for the experiment is the Caltech-256 Object Category Dataset.3 From the original dataset, a subset of 203 images with 10 different objects is randomly selected. Each image is resized to 128 × 128. Fig. 16 shows sample images from the Caltech-256 Dataset. 5.2. Evaluation measures Let ID be a dataset of images belonging to multiple semantic classes. Also, let Q be a query image which can be classified in one of these semantic classes. The experiment consists in retrieving the top K most (less) (dis)similar images to Q from the image dataset ID . Specifically, the (dis)similarity between Q and each image i ∈ ID is computed. Then, the top K most (less) (dis)similar images to Q are retrieved, which are considered as relevant if they have the same semantic class of Q . From the K retrieved images, four performance measures are computed for evaluating the effectiveness of the (dis)similarity measure in CBIR: (i) precision, (ii) recall, (iii) accuracy, and (iv) Mean Average Precision (MAP) [24]. 5.2.1. Precision The precision is widely used in CBIR for the evaluation of different (dis)similarity measures [18,20,39]. It is defined as the fraction of the number of relevant retrieved images over the number of retrieved images for a query image Q . P =
#relev ant retriev ed images #retriev ed images
.
(10)
2 The original image dataset is available at: https://www.nist.gov/srd/nistspecial-database-19. 3 The original image dataset is available at: http://www.vision.caltech.edu/ Image_Datasets/Caltech256/.
5.2.2. Recall The recall is defined as the fraction of the number of relevant retrieved images for a query image Q over the number of images in the class of Q . R=
#relev ant retriev ed images #relev ant images
.
(11)
5.2.3. Accuracy The accuracy is defined as the fraction of the total number of relevant retrieved images over the number of images in the dataset.
∑ Acc =
Q ∈Q
A(Q )
| ID |
,
(12)
where Q is the set of test images (query images), and A(Q ) is the number of relevant retrieved images for the query image Q . 5.2.4. Mean average precision From the values of precision and recall, the precision vs. recall curve (P–R curve) represents the trend of precision according to the recall. A better P–R curve is far from the origin of the axes, and its area can be approximated by the MAP value: MAP =
1 ∑
|Q|
AP(Q ),
(13)
Q ∈Q
where Q is the set of test images (query images), and AP(Q ) is the average precision value for the top K retrieved images. 5.3. Competing methods The proposed method is compared with two similar template matching methods: (i) the baseline ACSM method [18,19], and (ii) the A-ACSM method [20], which is the approximate variant of ACSM. Furthermore, the proposed method is compared with the two well-known SIFT and SURF methods, which are widely used in CBIR for hand image, palm vein and fingerprint recognition [57–59], and for handwritten character and object recognition [60,61]. 5.3.1. ACSM method The baseline ACSM method computes the similarity between two images as the average area of the largest common square sub-matrices exactly matching in the two images. Accordingly, for each position in the first image, the method finds the largest square sub-matrix, of area greater than or equal to α , starting at that position, exactly matching within the entire second image.
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
11
Fig. 16. Sample of object images from the Caltech-256 Dataset.
The exact match of two square sub-matrices implies that all pixels need to be identical at the corresponding positions in order to obtain a positive match. 5.3.2. A-ACSM method The A-ACSM method computes the similarity between two images as the average area of the largest common square submatrices approximately matching in the two images. The approximation consists of omitting a portion of pixels at regular intervals along the rows and columns of the square sub-matrices during the matching process. Accordingly, two parameters of row and column offset, ∆r and ∆c, set the inter-pixel distance to be considered for the match. Hence, those pixels falling closer than the inter-pixel distance are omitted from the match. 5.3.3. SIFT and SURF methods Two different approaches are employed for comparing the SIFT and SURF descriptors. Let A and B be two images to compare and px and py their corresponding local descriptor sets (SIFT descriptors or SURF descriptors). The dissimilarity between A and B is computed using the 1-NN Similarity Average with Euclidean distance [37,62]. Accordingly, the dissimilarity between A and B is computed as the average Euclidean distance between the local descriptors in A and their nearest neighbour descriptors in B: d1 (A, B) =
1 ∑
|A|
minpy ∈B (d(px , py )),
(14)
1 ∑
|A|
Class name ϵ -ACSM A-ACSM ACSM
SIFTav g SIFTcount SURFav g SURFcount
Subject1 Subject2 Subject3 Subject4 Subject5
91.11 100.00 100.00 100.00 100.00
91.11 100.00 91.11 100.00 98.61
91.11 100.00 97.22 100.00 100.00
91.11 100.00 100.00 100.00 95.56
0.00 0.00 0.00 0.00 100.00
91.11 100.00 97.22 97.22 59.03
73.89 93.89 97.22 97.22 36.94
Mean
98.22
96.17
97.67
97.33
20.00
88.92
79.83
Table 2 Comparison of average recall when using the IIT Delhi Touchless Palmprint Dataset for the top K = 5 retrieval. Class name ϵ -ACSM A-ACSM ACSM SIFTav g
SIFTcount
SURFav g
SURFcount
Subject1 Subject2 Subject3 Subject4 Subject5
42.86 50.00 50.00 50.00 50.00
42.86 50.00 42.86 50.00 48.81
42.86 50.00 47.62 50.00 50.00
42.86 50.00 50.00 50.00 46.43
0.00 0.00 0.00 0.00 50.00
42.86 50.00 47.62 47.62 28.57
33.33 45.24 47.62 47.62 16.67
Mean
48.57
46.90
48.10
47.86
10.00
43.33
38.10
Table 3 Comparison of accuracy when using the IIT Delhi Touchless Palmprint Dataset. Measure
ϵ -ACSM A-ACSM ACSM SIFTavg
SIFTcount
SURFav g
SURFcount
Accuracy
56.67
11.67
50.56
44.44
54.72
56.11
55.83
px ∈A
where |A| is the number of local descriptors of A, and d is the Euclidean distance between px and py . Also, the similarity between A and B is evaluated as Percentage of Matches [37]. This is defined as the percentage of local descriptors in A with a correspondence in B: sm (A, B) =
Table 1 Comparison of average precision when using the IIT Delhi Touchless Palmprint Dataset for the top K = 5 retrieval.
m(px , B),
(15)
px ∈A
where m(px , B) takes a value of 1 if px has a correspondence in B, and 0 otherwise. 5.4. Performance on IIT Delhi Touchless Palmprint Dataset This section tests the performance of the proposed method to retrieve images belonging to the same human subject (semantic class) of the query image Q from the dataset of hand shape images, and compares it with the competing methods. To make a fair experiment, the dataset is randomly split into training and test sets. Specifically, the 80% of the dataset is used as the training set, whereas the other 20% is used as the test set (query images). The mean average precision of E -ACSM is computed over the top K = 5 retrieval at different values of ϵ from 5 to 30 with intervals of 5, α equal to [4, 16, 36, 64, 100] and τ equal to [0.90, 0.80, 0.70]. As can be seen in Fig. 17, the best performance is obtained when τ is set between 0.80 and 0.90, specifically for τ equal to 0.90 and ϵ between 15 and 30, regardless of the α value. For τ
equal to 0.80, the best performance is obtained for ϵ between 25 and 30. In the rest of the experiment, ϵ , α and τ are set to the best values of 30, 4 and 0.90, respectively. Accordingly, α is set to 4 in ACSM and A-ACSM. Also, ∆r and ∆c of A-ACSM are set respectively to 1 and 4, which is the best setting found in [20]. Tables 1 and 2 show the average precision and recall (%) for each semantic class of the test set over the top K = 5 retrieval. It is worth noting that E -ACSM obtains the highest precision for all semantic classes, with a mean gain over the competing methods between 0.5% (ACSM) and 78.2% (SIFTcount ). In terms of recall, E -ACSM obtains the highest value for the different semantic classes, which is on average between 0.6% (ACSM) and 38.7% (SIFTcount ) higher than the competing methods. Table 3 shows the accuracy values (%). It is worth noting that the highest performance (56.67%) is obtained by ϵ -ACSM, which is followed by ACSM (56.11%), SIFTav g (55.83%), and the other competing methods. The P–R curve is depicted in Fig. 18. It shows that the largest area below the curve is obtained by ϵ -ACSM, which is followed by ACSM, SIFTav g and A-ACSM. By contrast, the other methods perform considerably worse with lower precision vs. recall values. 5.5. Performance on NIST Special 19 dataset This section tests the performance of the proposed method to retrieve images corresponding to the same letter of the alphabet (semantic class) of the query image Q from the dataset
12
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
Fig. 17. Mean average precision as a function of ϵ , α and τ equal to 0.90, 0.80, and 0.70 for the top K = 5 retrieval. Table 5 Comparison of the mean average precision, average recall and accuracy using the Caltech-256 dataset for the top K = 20 retrieval. Measure
ϵ -ACSM A-ACSM ACSM SIFTavg
SIFTcount
SURFav g
SURFcount
MAP Recall Accuracy
56.70 14.83 25.75
25.64 7.50 12.17
34.01 9.00 16.92
21.05 5.83 11.92
43.64 12.67 23.08
53.15 14.00 23.58
32.40 8.50 12.08
that image (h) belonging to the semantic class O. Accordingly, it is expected to find a very high retrieval performance. 5.6. Performance on Caltech-256 object category dataset Fig. 18. Curve of precision–recall obtained using the IIT Delhi Touchless Palmprint Dataset. Table 4 Comparison of the mean average precision, average recall and accuracy using the NIST dataset for the top K = 25 retrieval. Measure
ϵ -ACSM A-ACSM ACSM SIFTavg
SIFTcount
SURFav g
SURFcount
MAP Recall Accuracy
46.18 20.27 24.11
21.94 9.78 11.85
44.60 17.69 19.00
38.10 13.24 14.56
47.26 19.20 21.89
75.45 31.07 35.19
38.51 14.31 16.63
of alphabet letter images, regardless of the writer and the field origin. Accordingly, the only categorization of the images by class is considered, making the retrieval task more complex because of the different writing styles and field information of the involved images. The dataset is randomly partitioned into training set (80% of images) and test set (20% of query images). Also, the ϵ , α and τ parameters are set to the values achieving the best performance, which are 5, 16 and 0.90, respectively. Accordingly, α is set to 16 in ACSM and A-ACSM. Also, ∆r and ∆c of A-ACSM are set to 1 and 4, respectively. Table 4 shows the MAP, average recall and accuracy (%) for the different methods. To make the retrieval more difficult, the measures are computed over the top K = 25 retrieval. It is worth noting that ACSM is competitive, with a gain over ϵ ACSM of 29.3% in MAP, 10.8% in recall and 11.1% in accuracy. It is an expected result, since ACSM is the equivalent of ϵ -ACSM with no approximation in the similarity computation. However, a better performance is at the expense of a much higher computational complexity, which can be prohibitive for larger images (see Section 4.2). Fig. 19 shows the results of image retrieval (the top K = 7 retrieval) for a query image belonging to the semantic class J. The ϵ ACSM similarity value is reported on the bottom of each image. As it can be observed, all retrieved images are also relevant, except
The dataset has two random partitions into training set (80% of images) and test set (20% of query images). Also, the ϵ , α and τ parameters are set to the values obtaining the best result, which are 10, 4 and 0.90, respectively. Accordingly, α is set to 4 in ACSM and A-ACSM. Also, ∆r and ∆c of A-ACSM are set to 1 and 4, respectively. Table 5 reports the MAP, average recall and accuracy (%) for the different methods. To make the retrieval more difficult, the performance is averaged over the top K = 20 retrieval. It is worth noting that ϵ -ACSM overcomes all the competing methods, with a gain between 3.6% (ACSM) and 35.6% (SURFcount ) in MAP, between 0.83% (ACSM) and 9% (SURFcount ) in recall, and between 2.2% (ACSM) and 13.8% (SURFcount ) in accuracy. 6. Discussion From the experiment, it can be observed that a higher similarity threshold τ improves the performance of E -ACSM. Hence, setting a closer similarity between the square sub-matrices obtains better results. However, setting the maximum similarity (exact match) between the square sub-matrices as in ACSM may decrease the retrieval performance (see the IIT Delhi Touchless Palmprint Dataset and Caltech-256 Dataset). Also, E -ACSM performs better than the previously introduced approximate variant A-ACSM. It proves the efficacy of introducing a threshold-based approximation in computing the similarity as well as limiting the match of the square sub-matrices in a neighbourhood of the current position. It also provides better computational complexity than using an exact matching strategy, which can be quite demanding for larger images. 7. Conclusions and future directions In this paper, a new approach exploring two axiomatic conditions for the similarity of images was proposed. The first axiomatic condition resembles the lower-level notion of frequency in approximately matching image patches, which is a brand new
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
13
Fig. 19. Results of image retrieval for the semantic class J.
idea. The second axiomatic condition defines the higher-level notion of comparison between images, which is formalized as an average area of the common patches matching within the images. Then, ϵ -ACSM was introduced, which is a variant of the baseline ACSM similarity measure, conforming to the proposed theoretical approach. Its main characteristics were presented and its main properties were explored. Accordingly, it was shown that ϵ -ACSM overcomes the limitations of the previous ACSM-like similarity methods in terms of computational complexity, similarity and CBIR. An experiment was conducted for testing the proposed approach on benchmark datasets. The obtained results showed that ϵ -ACSM overcomes other competing methods specifically adopted for image retrieval, revealing its potential in different domains. As a future work, the axiomatic approach will be tested in different biometric contexts, including fingerprint recognition. Compliance with ethical standards Declaration of competing interest The authors declare that there is no conflict of interest. Ethical approval This article does not contain any studies with human participants or animals performed by the author. References [1] W. Zhou, H. Li, Q. Tian, Recent Advance in Content-based Image Retrieval: A Literature Survey. CoRR, abs/1706.06064. http://arxiv.org/abs/1706.06064 (2017). [2] K. Iqbal, M.O. Odetayo, A. James, Content-based image retrieval approach for biometric security using colour, texture and shape features controlled by fuzzy heuristics, J. Comput. System Sci. 78 (4) (2012) 1258–1277.
[3] R. Zhou, D. Zhong, J. Han, Fingerprint identification using SIFT-based minutia descriptors and improved all descriptor-pair matching, Sensors 13 (3) (2013) 3142–3156. [4] H. Müller, N. Michoux, D. Bandon, A. Geissbuhler, A review of contentbased image retrieval systems in medical applications-clinical benefits and future directions, Int. J. Med. Inform. 73 (1) (2004) 1–23. [5] V.V. Estrela, A.E. Herrmann, Content-Based Image Retrieval (CBIR) in Remote Clinical Diagnosis and Healthcare, Encyclopedia of E-Health and Telemedicine, Hershey, PA: IGI Global, 2016, pp. 495–520. [6] M. Molinier, J. Laaksonen, T. Hame, Detecting man-made structures and changes in satellite imagery with a content-based information retrieval system built on self-organizing maps, IEEE Trans. Geosci. Remote Sens. 45 (4) (2007) 861–874. [7] N. Laban, M. ElSaban, A. Nasr, H. Onsi, System refinement for content based satellite image retrieval, Egypt. J. Remote Sens. Space Sci. 15 (1) (2012) 91–97. [8] C. Wengert, M. Douze, H. Jégou, Bag-of-colors for improved image search, in: Proceedings of ACM International Conference on Multimedia, 2011, pp. 1437–1440. [9] F. Wang, L. Kang, Y. Li, Sketch-based 3d shape retrieval using convolutional neural networks, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1875–1883. [10] X.-Y. Wang, B.-B. Zhang, H.-Y. Yang, Content-based image retrieval by integrating color and texture features, Multimedia Tools Appl. 68 (3) (2014) 545–569. [11] D.G. Lowe, Object recognition from local scale-invariant features, in: Proceedings of IEEE International Conference on Computer Vision, Vol. 2, 1999, pp. 1150–1157. [12] H. Bay, T. Tuytelaars, L. Van Gool, Surf: Speeded up robust features, in: Proceedings of Computer Vision ECCV, 3951, 2006, pp. 404–417. ¨ [13] H. Liu, D. Song, S. Rger, R. Hu, V. Uren, Comparing dissimilarity measures for content-based image retrieval, in: Proceedings of the Fourth Asia Information Retrieval Conference on Information Retrieval Technology, Harbin, China, 2008, pp. 44–50. [14] J.P. Van de Geer, Some Aspects of Minkowski Distance, Leiden University, Department of Data Theory, 1995. [15] A. Strehl, J. Ghosh, Cluster ensembles – a knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res. 3 (2002) 583–617. [16] W.K. Pratt, Digital Image Processing, Wiley, New York, 1991. [17] C. Barnes, D. Goldman, E. Shechtman, A. Finkelstein, The patchmatch randomized matching algorithm for image manipulation, Commun. ACM 54 (11) (2011) 103–110. [18] A. Amelio, C. Pizzuti, Average Common Submatrix: A New Image Distance Measure, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 170–180.
14
A. Amelio / Applied Soft Computing Journal 81 (2019) 105474
[19] A. Amelio, C. Pizzuti, A patch-based measure for image dissimilarity, Neurocomputing 171 (2016) 362–378. [20] A. Amelio, Approximate matching in ACSM dissimilarity measure, Procedia Comput. Sci. 96 (2016) 1479–1488. [21] A. Amelio, D. Brodić, The E -Average Common Submatrix: Approximate Searching in a Restricted Neighborhood. IWCIA: 7–11, (short comm.) arXiv: 1706.06026 (2017). [22] A. Nazir, R. Ashraf, T. Hamdani, N. Ali, Content based image retrieval system by using HSV color histogram, discrete wavelet transform and edge histogram descriptor, in: Proceedings of IEEE International Conference on Computing, Mathematics and Engineering Technologies (ICoMET), 2018, pp. 1–6. [23] N. Ali, K.B. Bajwa, R. Sablatnig, Z. Mehmood, Image retrieval by addition of spatial information based on histograms of triangular regions, Comput. Electr. Eng. 54 (2016) 539–550. [24] B. Zafar, R. Ashraf, N. Ali, M.K. Iqbal, M. Sajid, S.H. Dar, N.I. Ratyal, A novel discriminating and relative global spatial image representation with applications in CBIR, Appl. Sci. 8 (2018) 2242. [25] B. Zafar, R. Ashraf, N. Ali, M. Ahmed, S. Jabbar, et al., Image classification by addition of spatial information based on histograms of orthogonal vectors, PLOS ONE 13 (6) (2018) e0198175. [26] B. Zafar, R. Ashraf, N. Ali, M. Ahmed, S. Jabbar, K. Naseer, A. Ahmad, G. Jeon, Intelligent image classification-based on spatial weighted histograms of concentric circles, Comput. Sci. Inf. Syst. 15 (3) (2018) 615–633. [27] N. Ali, B. Zafar, F. Riaz, S. Hanif Dar, l. N. Iqbal Ratya, K. Bashir Bajwa, M. Kashif Iqbal, M. Sajid, A hybrid geometric spatial image representation for scene classification, PLoS One 13 (9) (2018) e0203339. [28] B. Wang, Z. Li, M. Li, W.y. Ma, Large-scale duplicate detection for web image search, in: Proceedings of IEEE International Conference on Multimedia and Expo, Toronto, Ont, 2006, pp. 353–356. [29] C. Siagian, L. Itti, Rapid biologically-inspired scene classification using features shared with visual attention, Proc. IEEE Trans. Pattern Anal. Mach. Intell. 29 (2) (2007) 300–312. [30] J. Wang, X.-S. Hua, Interactive image search by color map, ACM Trans. Intell. Syst. Technol. 3 (1) (2011) 12. [31] N. Ali, K.B. Bajwa, R. Sablatnig, S.A. Chatzichristofis, Z. Iqbal, A novel image retrieval based on visual words integration of SIFT and SURF, PLoS One 11 (6) (2016) e0157428. [32] M. Calonder, V. Lepetit, C. Strecha, P. Fua, Brief: binary robust independent elementary features, in: Proceedings of European Conference on Computer Vision, 2010, pp. 778–792. [33] S. Zhang, Q. Tian, Q. Huang, W. Gao, Y. Rui, USB: Ultra-short binary descriptor for fast visual matching and retrieval, IEEE Trans. Image Process. 23 (8) (2014) 3671–3683. [34] H. Jégou, C. Schmid, H. Harzallah, J. Verbeek, Accurate image search using the contextual dissimilarity measure, IEEE Trans. Pattern Anal. Mach. Intell. 32 (1) (2010) 2–11. [35] W. Kong, W.-J. Li, M. Guo, Manhattan hashing for large-scale image retrieval, in: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, 2012, pp. 45–54. [36] X. Xu, Q. Zhang, Medical image retrieval using local binary patterns with image euclidean distance, in: Proceedings of International Conference on Information Engineering and Computer Science, Wuhan, 2009, pp. 1–4. [37] G. Amato, F. Falchi, F. Gennaro, Landmark recognition in VISITO tuscany, in: International Workshop on Multimedia Cultural Heritage (MM4CH), 2011, pp. 1–13. [38] C.E. Shannon, A mathematical theory of communication, Bell Syst. Tech. J. 27 (3) (1948) 379–423.
[39] G.D. Tourassi, B. Harrawood, Evaluation of information-theoretic similarity measures for content-based retrieval and detection of masses in mammogragrams, Med. Phys. 34 (1) (2007) 140–150. [40] J.M.Jr. Zachary, An Information Theoretic Approach to Content based Image Retrieval (Ph.D. Dissertation), Louisiana State University and Agricultural & Mechanical College, 2000, AAI9998723. [41] M. Murayama, D. Oguro, H. Kikuchi, H. Huttunen, Y.S. Ho, J. Shin, Color-distribution similarity by information theoretic divergence for color images, in: Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Jeju, 2016, pp. 1–4. [42] K. Pushpalatha, V.S. Ananthanarayana, An information theoretic similarity measure for unified multimedia document retrieval, in: Proceedings of 7th International Conference on Information and Automation for Sustainability, Colombo, 2014, pp. 1–6. [43] M.M. Rahman, P. Bhattacharya, B.C. Desai, Statistical similarity measures in image retrieval systems with categorization & block based partition, in: Proceedings of IEEE International Workshop on Imaging Systems and Techniques, 2005, pp. 92–97. [44] D. Zhang, G. Lu, Evaluation of similarity measurement for image retrieval, in: Proceedings of International Conference on Neural Networks and Signal Processing, Vol. 2, Nanjing, 2003, pp. 928–931. [45] H. Cho, L. Hadjiiski, B. Sahiner, H.-P. Chan, M. Helvie, C. Paramagul, A.V. Nees, Similarity evaluation in a content-based image retrieval (CBIR) CADx system for characterization of breast masses on ultrasound images, Med. Phys. 38 (4) (2011) 1820–1831. [46] M. Tzelepi, A. Tefas, Deep convolutional learning for content based image retrieval, Neurocomputing 275 (2018) 2467–2478. [47] R.R. Saritha, V. Paul, P.G. Kumar, Cluster Comput. (2018) 1–19. [48] A. Qayyum, S.M. Anwar, M. Awais, M. Majid, Medical image retrieval using deep convolutional neural network, Neurocomputing 266 (2017) 8–20. [49] C. Bai, L. Huang, X. Pan, J. Zheng, S. Chen, Optimization of deep convolutional neural network for large scale image retrieval, Neurocomputing 303 (2018) 60–67. [50] H. Liu, B. Li, X. Lv, Y. Huang, Image retrieval using fused deep convolutional features, Procedia Comput. Sci. 107 (2017) 749–754. [51] J. Ziv, An axiomatic approach to the notion of similarity of individual sequences and their classification, in: First Int. Conf. on Data Compr. Comm. and Proc., 2011, pp. 3–7. [52] R.W. Hamming, Error detecting and error correcting codes, Bell Syst. Tech. J. 29 (2) (1950) 147–160. [53] K. Fredriksson, E. Ukkonen, Algorithms for 2-d hamming distance under rotations. Manuscript (1999). [54] P. Kunsoo, Analysis of two-dimensional approximate pattern matching algorithms, Theoret. Comput. Sci. 201 (12) (1998) 263–273. [55] A. Amir, G.M. Landau, Fast parallel and serial multidimensional approximate array matching, Theoret. Comput. Sci. 81 (1) (1991) 97–115. [56] A. Vedaldi, B. Fulkerson, VLFeat: An Open and Portable Library of Computer Vision Algorithms. http://www.vlfeat.org/ (2008). [57] W. Kang, Y. Liu, Q. Wu, X. Yue, Contact-free palm-vein recognition based on local invariant features. wennekers t, ed., PLoS One 9 (5) (2014) e97548. [58] N. Charfi, H. Trichili, A.M. Alimi, B. Solaiman, Novel hand biometric system using invariant descriptors, in: 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR), 2014, pp. 261–266. [59] R. Zhou, S. Sin, D. Li, T. Isshiki, H. Kunieda, Adaptive SIFT-based algorithm for specific fingerprint verification, in: International Conference on Hand-Based Biometrics, 2011, pp. 1–6. [60] L. Chergui, M. Kef, SIFT descriptors for Arabic handwriting recognition, Int. J. Comput. Vis. Robot. 5 (4) (2015) 441–461. [61] M. Alex, S. Das, An approach towards malayalam handwriting recognition using dissimilar classifiers, Procedia Technol. 25 (2016) 224–231. [62] S. Hua, G. Chen, H. Wei, Q. Jiang, Similarity measure for image resizing using sift feature, EURASIP J. Image Video Process. 2012 (1) (2012) 1–11.