Saliency-driven image classification method based on histogram mining and image score

Saliency-driven image classification method based on histogram mining and image score

Author's Accepted Manuscript Saliency-driven Image Classification Method Based on Histogram Mining and Image Score Baiying Lei, Ee-Leng Tan, Siping C...

2MB Sizes 0 Downloads 29 Views

Author's Accepted Manuscript

Saliency-driven Image Classification Method Based on Histogram Mining and Image Score Baiying Lei, Ee-Leng Tan, Siping Chen, Dong Ni, Tianfu Wang

www.elsevier.com/locate/pr

PII: DOI: Reference:

S0031-3203(15)00052-7 http://dx.doi.org/10.1016/j.patcog.2015.02.004 PR5343

To appear in:

Pattern Recognition

Received date: 27 August 2013 Revised date: 27 December 2014 Accepted date: 7 February 2015 Cite this article as: Baiying Lei, Ee-Leng Tan, Siping Chen, Dong Ni, Tianfu Wang, Saliency-driven Image Classification Method Based on Histogram Mining and Image Score, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2015.02.004 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Title Saliency-driven Image Classification Method Based on Histogram Mining and Image Score Author name and affiliations Baiying Lei a (aDepartment of Biomedical Engineering, School of Medicine, Shenzhen University, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Tel: 86-755-26534314. Fax: 86-755-26534940, e-mail: [email protected]) Ee-Leng Tanb (bSchool of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Avenue 50, S2B4a-03 ,Digital Signal Processing Laboratory,Singapore 639798, phone: (65) 6790 6900 , fax: (65) 6792 0415, email: [email protected])

Siping Chena* ( aDepartment of Biomedical Engineering, School of Medicine, Shenzhen University, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Tel: 86-755-26536718. Fax: 86-755-26534940, e-mail: [email protected]) Corresponding author:Dong Nia (aDepartment of Biomedical Engineering, School of Medicine, Shenzhen University, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Nanhai Ave 3688, Shenzhen, Guangdong, P. R.China, 518060, Tel: 86-755-26534314. Fax: 86-755-26534940, e-mail: [email protected])

Corresponding author: Tianfu Wanga* ( a Department of Biomedical Engineering, School of Medicine, Shenzhen University, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Tel: 86-755-26534314. Fax: 86-755-26534940, e-mail: [email protected]) Abstract Since most image classification tasks involve discriminative information (i.e., saliency), this paper proposes a new bag-of-phrase (BoP) approach to incorporate this information. Specifically, saliency map and local features are first extracted from edge-based dense descriptors. These features are represented by histogram and mined with discriminative learning technique. Image score calculated from the saliency map is also investigated to optimize a support vector machine (SVM) classifier. Both feature map and kernel trick methods are explored to enhance the accuracy of the SVM classifier. In addition, novel inter- and intra-class histogram normalization methods are investigated to further boost the performance of the proposed method. Experiments using several publicly available benchmark datasets demonstrate that the proposed method achieves promising classification accuracy and superior performance over state-of-the-art methods. Highlight A new saliency-driven bag-of-phrase approach for image classification is proposed. ->Edge-based dense descriptor is applied. -> Histogram mining and discriminative learning are investigated.-> Image score is adopted as latent information to optimize linear classifier.-> Novel inter- and intra-class histogram normalization methods are explored. Key Words Image classification; Bag of Phrase; Saliency Map; Histogram Mining; Image Score.

Acknowledgement

This work was supported partly by National Natural Science Foundation of China (Nos. 61402296, 61101026, 61372006, 81270707and 61427806), China Postdoctoral Science Foundation Funded Project (No. 2013M540663

and No. 2014T70824), the 48th Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry, National Natural Science Foundation of Guangdong Province (No. S2013040014448), Shenzhen Key Basic Research Project (No. JCYJ20130329105033277), and Shenzhen-Hong Kong Innovation Circle Funding Program (No. JSE201109150013A). Author Biography Baiying Lei received her M. Eng degree in electronics science and technology from Zhejiang University, China in 2007, and Ph.D. degree from Nanyang Technological University (NTU), Singapore in 2013. She is currently with National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Medicine, Shenzhen University, P. R. China. Her current research interests include audio/image signal processing, digital watermarking and encryption, machine learning and pattern recognition. Siping Chen received his Ph.D. degree in biomedical engineering from Xi'an Jiaotong University in 1987. He is currently a Professor in Department of Biomedical Engineering, School of Medicine, Shenzhen University and Director of National-Regional Key Technology Engineering Laboratory for Medical Ultrasound. His research interests include ultrasound imaging, digital signal processing and pattern recognition.

Ee-Leng Tan received his BEng (1st Class Hons) and Ph.D. degrees in Electrical and Electronic Engineering from Nanyang Technological University in 2003 and 2012, respectively. Currently, he is with NTU as a Research Fellow. His research interests include image/audio processing and real-time digital signal processing. Dong Ni received his Ph.D. degree in computer science and engineering from the Chinese University of Hong Kong in 2009. He is currently an Associate Professor in Department of Biomedical Engineering, School of Medicine, Shenzhen University. His research interests mainly include ultrasound image analysis, image guided surgery and pattern recognition.

Tianfu Wang received his Ph.D. degree in biomedical engineering from Sichuan University in 1997. He is currently a Professor in Department of Biomedical Engineering, School of Medicine, Shenzhen University and the Associate Chair of School of Medicine. His research interests include ultrasound image analysis, pattern recognition and medical imaging.

Saliency-driven Image Classification Method Based on Histogram Mining and Image Score Abstract Since most image classification tasks involve discriminative information (i.e., saliency), this paper proposes a new bag-of-phrase (BoP) approach to incorporate this information. Specifically, saliency map and local features are first extracted from edge-based dense descriptors. These features are represented by histogram and mined with discriminative learning technique. Image score calculated from the saliency map is also investigated to optimize a support vector machine (SVM) classifier. Both feature map and kernel trick methods are explored to enhance the accuracy of the SVM classifier. In addition, novel inter- and intra-class histogram normalization methods are investigated to further boost the performance of the proposed method. Experiments using several publicly available benchmark datasets demonstrate that the proposed method achieves promising classification accuracy and superior performance over state-of-the-art methods. Key Words: Image classification; Bag of Phrase; Saliency Map; Histogram Mining; Image Score.

1. Introduction In the recent decade, image classification has become a hot research topic in the computer vision field. The bag-of-word (BoW) technique has been shown as one of the most effective algorithms for image classification [1, 2]. As the spatial contextual information such as words co-occurrence and pairwise information is ignored [3-5] by the BoW method, the visual words produced by the BoW method may not be illustrative enough and limit the overall classification performance [6]. In addition, the BoW method is based on the assumption (i.e., naive Bayesian) that narrow feature is independent of each other, and this assumption may not always hold [3]. To address these limitations of the BoW method, both bag-of-phrase (BoP) including its extension [5] and fusion of BoW and BoP algorithms [4] have been proposed to increase the descriptive and discriminative power of visual words. The BoP method integrates context information between images and incorporates visual saliency as side information [3-5, 710]. Visual saliency with high quality segmentation masks [11, 12] are also utilized in BoP method to localize

region-of-interest (RoI) of images for effective classification. Another increasingly popular method is the spatial pyramid matching (SPM) [2] model which generates hybrid image features and represents image hierarchically by multi-scale partition. As reported in [8, 11, 13, 14], SPM based algorithms not only achieved remarkable results but also delivered promising performance on numerous benchmark datasets. Therefore, a myriad of extensions have been motivated by the SPM model, and these extensions outperform the original SPM model even using a single descriptor [7, 10, 12, 15]. However, numerous common patterns are generated by multi-scale partition, and the context relationship between each word (wordoccurrence) is quite high. On this note, the integration of saliency and mining technique with SPM method would probably improve performance, but it remains unconsidered in existing SPM based approaches. In the literature, keypoint based SIFT descriptor has been commonly used for image classification and detection [1, 16]. Meanwhile, densely computed or sampled based descriptors (i.e. dense SIFT (DSIFT), Phow [17], and DAISY [18, 19]) are found to be more promising in resolving image classification issues than typical SIFT descriptor, especially by integrating discriminative techniques [11, 18, 20]. Essentially, DSIFT descriptor is a fast implementation of SIFT descriptor using densely sampled keypoints of the same orientation and size, whereas Phow descriptor is a Gaussian smoothing filtered implementation of DSIFT. DAISY descriptor is introduced by Tola et al. [18] to improve classification accuracy using a highly dense computed feature. Compared to the conventional sparse SIFT method [21], there is no need to assume any probabilistic models of the input data when dense descriptors are used. Furthermore, dense descriptors perform faster than the traditional SIFT descriptor and can be reused for multiple images of the same size [17, 18]. These advantages of dense descriptors have led to lower computational cost as compared to SIFT descriptor. Support vector machine (SVM) has been one of the most popular and widely used classifiers in many image classification tasks [17, 22-24]. In addition, discriminative learning in the SVM classifier using the training data has been gaining popularity [11, 25]. Unlike existing SVM framework, an image score generated from the saliency map is fed into the SVM classifier to increase discriminative power, as well as to improve the classification accuracy. This paper addresses the image classification issue using salient regional localization, histogram mining, image score, and discriminative learning. The novelty and contributions of this paper are as follows: Firstly, multiscale spatial pyramid based image representation is combined with histogram mining technique. Edge-based dense descriptors are utilized to improve classification performance and reduce computation time. Secondly, saliency map

is extracted to construct the BoP model and increase the discriminative ability of features. Thirdly, image score obtained from visual saliency is integrated into the separation hyperplane of the SVM classifier. This image score is treated as the latent information to increase discriminative power of the features. Finally, novel histogram normalization methods, which not only consider inter- and intra-class variations, but also histogram statistical information is proposed. The organization of this paper is as follows. Section 2 introduces the related work. Section 3 discusses the framework and methodologies applied in the proposed method. Experimental results and discussions based on various publicly available datasets are presented in Section 4. Finally, Section 5 summarizes this paper.

2. Related work Recently, there is significant interest in image classification work [3-5, 7-10, 26] driven by the huge success achieved by bag-of-feature (BoF) and its extensions. One of the most popular BoF extensions is BoW method. In the BoW method, each image is represented by a local feature, and then mapped into a codebook. The occurrence count of features is often represented by histogram [1-5]. However, conventional BoW methods usually ignore the spatial information of the image [27-29] and produce features that suffer from limited descriptive power and poor accuracy rate. To address this weakness, many alternative solutions have been proposed and discussed in the literature [2, 8, 10, 11, 13, 14]. One such approach is the BoP method, which integrates context information such as word cooccurrence to represent an image. By accumulating and concatenating the corresponding neighbouring words, performance of BoW method is further improved by investigating the difference vector expansion between word center and local descriptors in pooling feature vectors rather than code words [26]. To date, in the BoW representation, saliency and visual contextual information (i.e., co-occurrence, relationship or correlation between intra- and inter-class, spatial, spectral, spatial-cepstral, and semantic information with class labels) has extensively been studied and explored for image classification [3, 7, 11-14, 20, 30]. For feature represented by local descriptors, local features are more useful than global features to solve image classification problems [21, 27, 29, 31]. To extract local features, one of the commonly used descriptors is the SIFT descriptor [1, 16], which is a sparse descriptor. On the other hand, dense descriptors such as DSIFT, Phow, DAISY, and enhanced SIFT descriptor based on regular or random dense sampling have become the state-of-the-art and outperformed the sparse SIFT descriptor in the recent literature [11, 17-19, 32]. On this note, a myriad of image

classification works have integrated visual words from dense descriptor and achieved remarkable results [11, 17, 19]. Specifically, dense descriptor tends to have more shape or edge information than conventional methods [19, 32]. For example, in [19], the combination of DAISY descriptor with edge map not only produced good classification accuracy, but also reduced the feature computation time based on the collected logo dataset. In addition, features such as textures, contours, and edges can be well-represented using saliency map to enhance the distinguishing power of descriptor [19, 32]. Meanwhile, it is found that the integration of the SPM approach with existing models such as BoF, BoW, and BoP leads to better classification results than those without SPM approach. Therefore, numerous image classification methods based on SPM model and its extensions have been proposed [1, 2, 7, 10, 12-15]. For example, Yang et al. [13] enhanced the SPM method using generalized vector quantization and sparse coding. Other extensions include pooling on multiple spatial scales with maximum rule rather than average pooling of multi-scale spatial feature. In [14], the SPM method was extended by the introduction of spatial co-occurrence to capture the relative arrangement of the code words. Recently, image classification performance has been further improved by some novel representation methods [31, 33-35]. For instance, a novel discriminative image presentation (i.e., grouplet) [33] was adopted to mine common images. In [31], Zhou et al. proposed an effective and accurate image classification algorithm based on the support vector coding. Quattoni and Torralba [34] addressed the scene classification issue by modelling the scene layout using different prototypes to capture the features and arrangement of scene components. In [35], Perrnnin and Dance applied the fisher kernel on visual words to classify objects without certain assumption (i.e., independent feature). For the classification task, both the generative and discriminative learning models are quite effective. Discriminative model such as SVM learns discriminative features and builds decision boundaries to discriminate among classes based on class-conditional pdf and prior probabilities, whereas the generative model such as GMM estimates posterior probabilities and utilizes density distribution to create a probability model. As a discriminative model, SVM classifier and its derivatives exhibit superior performance [17, 22]. Generative learning model is quite desirable for exploring underlying image distribution, but it is generally complex in terms of computation [36]. On the other hand, discriminative methods [8, 11] requires low computational cost when applied to labelled training data. There has been a tremendous effort on combining generative, discriminative or fusion of these two approaches for classification [4, 8, 11, 21]. One possible combination uses the generative model for image representation and

the discriminative model for feature extraction using the similarity refinement analysis [12, 37].

3.Methodology 3.1. System diagram and overview

As illustrated in Fig. 1, the framework of the proposed image classification scheme is composed of both the offline training and online testing modes. The main operations for the proposed image classification scheme include SPM partition, saliency map extraction, histogram mining and normalization, and SVM classifier [22]. Densely computed descriptor is used as the local feature descriptor after saliency map calculation and extraction. By the kmeans approach, the local features generated from the statistics of the code word’s appearance are mapped into a vocabulary. Finally, the SVM classifier is used to make the decision rule. Several pre-processing techniques such as gray image extraction, standard multi-scale partition technique, and normalization are performed on the input images. The popular SPM approach proposed by Lazebnik et al. [2] is applicable to divide the input image into patches so as to modify the structure information for image representation. Fig. 2 illustrates the multi-scale partition technique to generate the image spatial pyramid. As shown in Fig. 2, 1×1 (original image), 2×2 and 4×4 regions (a total of 21 regions) are generated by the SPM technique. In the proposed method, the aspect ratio of the input image is maintained by the SPM multi-scaling technique. The feature vector obtained from the multi-scale partition and the dense descriptor is usually of high dimension, popular dimension reduction algorithms such as principle component analysis (PCA) and linear discriminant analysis (LDA) are used to reduce the dimension of the feature vector without introducing significant loss. In our proposed scheme, the kernel based PCA and LDA model proposed by Yang et al.[38] is used to reduce the dimension of the feature vector.

Train Image Database

Multi-scale Partition

Saliency Map

Image Representation

Dense Descriptor

Image Score Extraction 29.jpg

Offline Training

Discriminative Learning and Histogram Mining

Online Testing Test Image Database

Multi-scale Partition

Saliency Map

Dense Descriptor

Image Representation

Image Score

Decision Rule

Fig. 1. Diagram of the proposed image classification scheme.

Fig. 2. Multi-scale image partition.

SVM Classifier

Histogram Normalization

3.2. Image representation and histogram mining

Histogram mining is an effective approach to reduce the number of redundant features introduced by the multi-scale partitioning. As shown in Fig. 3, the original histogram is constructed using the edge information calculated by the Canny operator. Based on the discriminative score, histogram mining is then performed and generates the final histogram for classification. It is noted by Fernando et al. [32] that optimal results are difficult to achieve using discriminativity and representativity criteria. However, hybrid relevant and non-redundant patterns can be included to generate better results using the pattern relevant criteria. Based on both discriminitivity and representativity scores, the histogram mining score for pattern m is computed as:

R(m)  d (m)  v(m), (1) where d (m) (0  d (m)  1) and v(m) are discriminitivity and representativity scores, respectively. High value of

R(m) means pattern m is very discriminative across image and suitable for classification.

Fig. 3. Illustration of histogram mining process, where green dot represents the central location and red square represents the size of the pattern. From left to right: original image,edge information of the original image, computed descriptor based on the edge map information.

Assuming p(c | m) is the conditional probability of image class c and the specific pattern m, discriminitivity based on the entropy approach [32] is defined as:

d (m)  1 

 p(c | m) log p(c | m) c

log O

,

(2) where

logO

is a regularization parameter to ensure

images in the class,

0  d (m)  1. Supposing Ai is

the i-th image and N is the total number of

F (m | Ai ) is the frequency pattern, p(c | m) is denoted as: N

p(c | m ) 

 F ( m | A ) p (c | A ) i

i 1

i

N

 F (m | A ) i 1

,

i

(3) where

F (m | Ai ) is equal to 1 if pattern m is in image Ai and 0 otherwise. Similarly, p(c | Ai ) is 1 if the class label of Ai is the

same as c and 0 otherwise.

Let P(c | m' ) denote the distribution for the class c and p(c | m) is calculated from the frequency F (m| c) , which is essentially the distribution of pattern m, representativity score is defined as the match over all images:

v(m)  max(exp [ DKL ( P(c | m' ) || p(c | m))]), c

(4) where

DKL ( || ) is the Kullabak-Leibler (KL) divergence between these two distributions.

3.3. Saliency map and image score

Visual saliency or point distribution is often used to obtain the generic saliency and emphasize different local regions such as contours, edges, and colours. Fig. 4 illustrates the image samples, saliency distribution, and saliency maps. It can be observed that meaningful and discriminative regions can be reliably obtained from the saliency map.

Fig. 4. Illustration of image samples and their saliency maps. From left to right: original images, saliency map distributions (red means high frequency) and binary saliency maps.

In this work, graph-based visual saliency (GBVS) [39] in a bottom-up saliency model has been utilized to extract saliency maps in images [30]. The co-occurrence of a code word between each pair of visual words is obtained and incorporated in the feature vector (BoP histograms). For n local descriptor s  {st }, t  1,..., n , each codeword v generated in the original BoW method, the saliency-driven method is generated by the following: n

s   t || v  st ||2 , t 1

(5)

where weighting factor  t at location ( x, y ) is defined as:

t  exp( 

|| x  y ||2 ), 2 2

(6)

where  is the weighting parameter. Let sic denote saliency of the i-th image in the c-th class and hic denote the BoP histogram of the i-th image in the c-th class with the approximate normalization. The image is represented by the saliency feature concatenated with the feature histogram:

X  [.., s1c h1c ,..., sic hic ,..., sN hNM ].

(7) For image score weighting, the image centre is calculated by choosing the cells which are minimized to the centre of each cell as below: M

Si  arg min  d (Si , S j ), Sj

j 1

(8) where d is the distance metric. Distance metric is based on the histogram of the local image cells.

 2 distance is adopted to

measure the difference as below: | |

2 s jr  sir

r 1

s jr  sir

d (Si , S j )  

.

(9) It is noted that if the cell is far away from the cell centre, it is highly probable that this cell is not contained in this class, and a lower weight should be assigned to this cell. Let d denote the distance between the i-th image score and the score centre of each class, image score is defined as: d

 d ( ) si 1 e 

i  f (d )   , d  0,  ( ) (10) where



is the strength metric,

si is the saliency map,  is the boundary factor, and () is the gamma function. The spatial

context information is integrated in SVM learning using the normalized image score, which is computed as:

i 

i |  | | |

 i

.

i 1

(11)

3.4. Kernel and feature map

As discussed in [17], classification result was improved using an explicit feature map that approximates a nonlinear kernel as a linear kernel. The main objective of this approach was to determine non-linear mapping of the feature vectors so that a linear algorithm can be applied in the feature space. This forms the basic idea of the popular

kernel trick method, which can be applied to any approaches based on the distance metric (i.e., nearest neighbours) such as intersection kernel, chi-square kernel (  2 ), and Jensen-Shannon (JS) kernel [22]. Generally, different kernels have different effects on the classification performance [32]. Table 1 shows the kernels and corresponding distance metrics to calculate distance between histograms x and z. Table 1 Kernels and distance metrics. Kernels

Definition

Distance metric

n

K min (x, z)   min( xi , zi )

Intersection

d 2 (x, z) || x  z ||

i 1

2 xi  zi i 1 xi  zi n

K  2 ( x, z )  

2

d 2 (x, z)   2 (x, z)

n x x z z x z  K JS (x, z)    i log 2 i i  i log 2 i i  2 x 2 zi  i 1  i

JS

d 2 (x, z)  DKL (x || (x  z) / 2)  DKL ( z || (x  z) / 2)

As reported in [17], the algorithm using inner product (i.e., linear SVM) can be operated in a non-linear way by replacing the inner product with a suitable kernel or the feature map. Motivated by the kernel trick approach, kernel calculation can be approximated and accelerated by feature map. The kernel calculation using feature map is denoted as:

K (x, z)  (x), (z), (12) where

(x)

and

(z)

are feature maps for the feature vectors and



is the inner product. For example, feature map for

Hellinger (or Bhattacharyya) kernel is obtained by calculating the element-wise square root:

(x)  x. It

is shown in [17] that

distance of the Hellinger kernel is equal to the Euclidean distance among feature map. Consequently, this feature map is the same as the non-linear Hellinger kernel, and it was demonstrated that Hellinger kernel produces superior results than linear kernel [17].

3.5. Histogram normalization

In most datasets, the testing and training images are of different widths and heights, which lead to variations in images and histogram. Multi-scale technique splits images into different sizes and causes variations as well. Since the histogram is the inputs of the SVM classifier, a normalized histogram probably leads to improved classification results. Therefore, it is desirable to perform normalization to achieve consistency for classification. l p norm is a

useful method to address the variation problem. l p norm for the real number when p  1 is defined as: 1/ p

x

p

 n     | xi | p   i 1 

, (13)

where

p 1

is

l1 norm and p  2 is l2 norm or the Euclidean norm. As discussed in [17], normalized histograms tend to favour

relatively larger scores in larger regions. When the histograms are

l2 normalized, the maximum value of region similarity in SVM classifier

can be found when two regions share the same appearance. Meanwhile, when histogram h is l p normalized, the feature mapping by kernel method is a constant. A detailed discussion of the relationship between classification and efficiency of the histogram normalization can be found in [24]. As shown in [24], histogram normalization is especially useful for kernel based SVM classifier. However, the statistical information of histogram distribution is often ignored. As observed from BoF-PDF algorithm in [26], promising classification score was achieved if the image is well-represented by the finitely distributed histogram after normalization.

It should be noted that the normalization process does not retain the original histogram distribution, which is dependent on the norm used in the normalization process. Differing from previous works, novel histogram normalization techniques have been proposed to represent the images. Figs. 5 and 6 illustrate the effects on the histogram PDF after normalization based on various norm methods for the soccer [40] and events dataset [41], respectively. Plots (a)–(f) in Figs. 5-6 show the histogram PDF of original (none), square root norm (sqrt), l1 norm (L1), l1 norm of intra- and inter-class (L1N), l2 norm (L2) and l2 norm of intra- and inter-class (L2N), respectively, where L1N and L2N methods are the proposed methods. Normalized histogram is distributed in finite dimensions, which exhibits characteristics similar to the Gaussian mixture model. The normalized histogram is also much better represented than non-normalized one generally. L1N and L2N methods take advantage of the variations between classes and perform normalization accordingly. As for the L1 and L2 methods, only l1 or l2 norm method is applied to the histogram feature in intraclass only, whereas the L1N and L2N methods extend the L1 and L2 methods by applying the l1 or l2 norm method to the inter- class. To further illustrate this, the empirical (histogram) distributions fitted by modelled parametric distributions are shown in each figure too. The blue, red, green and purple curves represent the fitted inverse Gaussian, generalized extreme value, normal, and log normal distribution, respectively. From Figs. 5-6, we can see that both inverse Gaussian and generalized extreme value distribution can better represent the events and soccer datasets as compared to normal and log normal distributions. It is noted that inter- and intra-class norm methods

reduce the inter- and intra-class variations. It is also observed that histogram feature is better represented by L1N method than L1 method, and similar findings are also found with L2N and L2 methods. Furthermore, L2N method achieves slightly better results than L1N method. Consistent with the observations reported in [17, 23], normalization boosts classification performance, and better performance is achieved if an appropriate norm method is applied.

Fig. 5. Histogram PDF parametric distribution of events dataset vs. different norm methods; (a) non-normalized (b) sqrt norm; (c) l1 norm; (d) l1 norm of inter and intra class; (e) l2 norm; (f) l2 norm of inter and intra class.

Fig. 6. Histogram PDF and parametric distribution of soccer dataset vs. different norm methods; (a) un-normalized (b) sqrt norm; (c) l1 norm; (d)

l1 norm of inter and intra class; (e) l2 norm; (f) l2 norm of inter and intra class.

3.6. Learning and classification

For classification task, PEGASOS SVM approach in [17] is adopted in our method. In SVM classifier, a hierarchical scoring scheme is applied to classify the images in the database, and the image scores are integrated for training as a conditional parameter. SVM scoring scheme using a hyperplane H is defined as: H : wxTi  b  0,

(14) where

w  ( w1 , w2 ,..., wN ) is an adaptable and adjustable weighting vector, b  R is a bias, and T represents the transpose

operation. Considering non-negative variables,

1 ,...,  N

( i.e.,

i  0 ,

for each i ), to make a generalized derivation of the

decision function, the above problem becomes:

di ( wxTi  b)  1  i i , i  1, 2,..., N .

(15)

The image scores are fitted into the SVM classifier as slack variables which permit the margin failure. Image scores have an effect on the tradeoff between the misclassified input vectors number and margin maximization. To obtain an optimal function and values, w is minimized as: min:

 2

N

wwT  C  i i 1

s.t. d i ( wx  b)  1  i i , T i

i  0, i  1, 2,..., N , (16)

where C is the regularization parameter.

4. Experimental results 4.1. Experimental datasets

For performance comparison, several popular and comprehensive datasets such as the natural scene image [2] ( scene 15) (total 15 classes), the complex event and activity UIUC-Sports images (events)1[41] (8 classes), the soccer team dataset (soccer)2 [40] (7 classes) containing seven soccer teams, the instrumental play dataset people playing musical instruments (PPMI) dataset3[11, 33, 37] (11 classes), and the Caltech 1014 dataset (102 classes) are used in our experiments. The classification result is evaluated based on the average classification accuracy over all classes in each dataset.

4.2. Effect of different number of visual words

More visual words often lead to better classification accuracy due to the higher discriminative power of words, but the improvement of classification accuracy diminishes beyond a certain length of visual words. For our comparison, the proposed method is evaluated based on various length of words (w = 128, 256, 512, 1024). Our comparative analysis considers popular algorithms such as BoF-PDF [26], fisher kernel [35], super vector coding [31], and BoF. As illustrated in Fig. 7, the proposed method produces the highest classification accuracy using the shortest code word (i.e., w = 128) based on the scene 15 dataset as compared to other methods. It is also noted that 1

The data set is available on: http://vision.stanford.edu/lijiali/event_dataset/ The data set is available on http://lear.inrialpes.fr/people/vandeweijer/data The data set is available on: http://ai.stanford.edu/bangpeng/ppmi.html

2 3 4

The data set is available on: http://www.vision.caltech.edu/feifeili/Datasets.htm

the classification accuracy of the proposed method increases with the code word length. The recently developed BoF-PDF method exhibits similar performance as the proposed method at w = 128 and outperforms other existing classification algorithms for all code word length. Attributed to the histogram mining and feature map, the proposed method possesses higher discriminative power to characterize local descriptors and classify the image compared to the BoF, BoW, and BoF-PDF method.

Fig. 7. Classification accuracy vs. different visual words (scene 15 dataset).

4.3. Effect of different descriptors

In the proposed method, several dense descriptors (e.g., Phow, DAISY and DSIFT) are applied to generate features for classification. To evaluate the classification performance of these descriptors, the scene 15, events, soccer, and PPMI datasets are utilized in our comparative analysis. Fig. 8 demonstrates the classification accuracy of the three descriptors. Generally, it is easier to discriminate patterns and features in a small sample set than a large sample set. The best classification accuracy is observed in the soccer dataset, and it should be noted that the total samples in this dataset are relatively small. We found that DAISY descriptor possesses the best performance among these three descriptors. The DAISY descriptor computes the features in each pixel, whereas DSIFT and Phow

descriptors compute the features on a regular dense grid. This difference causes the missing of some important features in the DSIFT and Phow descriptors, which leads to inferior classification performance as compared to the DAISY descriptor.

Fig. 8. Classification accuracy vs. different descriptors.

4.4. Experiments on various datasets

4.4.1 Experiment on scene 15 dataset

The natural scene 15 dataset [2] is composed of 4485 gray-scale images ( i.e., natural scenes and man-made environment images) of different sizes. One image from each class of the scene 15 dataset is shown in Fig. 9. Experimental setup for the scene 15 is the same as that in Lazebnik et al. [2], and all images are normalized to no more than 240×240 pixels. A total of 100 images are used for training in each class, and testing is performed with the remaining images.

Fig. 9. One sample image from each class in the scene15 dataset.

Table 2 summarizes the classification accuracy of various methods based on the scene 15 dataset. It is observed that the proposed method has the highest accuracy among the selected algorithms. As expected, the conventional SPM method in [2] has the lowest performance. The expanded BoW [5], salient coding [20], and sparse coding [42] are extended from SPM model and these models outperform the SPM model by using word constraints and occurrence. Among these algorithms, the super-vector coding method in [31] used the longest code word (w = 1024) to achieve reasonably high accuracy. While it is reasonable to use a longer code word to get a higher accuracy with a relatively large dataset (scene 15 has a total of 4485 images), the remaining methods achieved good performance using a short code word which led to faster computation. BoF-PDF [26] took advantage of aggregated difference vector rather than visual code words to achieve good classification accuracy. Using histogram mining, both the frequent itemset mining [32] and the proposed methods achieved high classification accuracy using short code word. The application of saliency maps as side information in the proposed method also contributes to higher classification accuracy. Table 2 Performance comparison on scene15 dataset. Method

Accuracy (%)

SPM [2]

81.40±0.5

Spatial pyramid co-occurrence [14]

82.51±0.43

Adapted Gaussian model [36]

85.40±0.36

Salient coding [20]

82.55±0.41

Expanded BoW [5]

83.76±0.59

Sparse coding +BoF [42]

84.3±0.5

Fisher kernel [35]

82.94±0.78

Super-vector coding (w=1024) [31]

84.79±0.76

BoF-PDF(w=256) [26]

85.63±0.67

Frequent itemset mining [32]

86.2±0.73

Proposed (w=256)

86.78±0.58

4.4.2.Experiment on UIUC-sports dataset

UIUC-sports dataset (events) [41] is a dataset of 8 sports complex event collected from the internet, which contains 1574 colour images with different sizes. It is challenging to classify this dataset as the images of this dataset contain highly cluttered and diverse background and objects of different sizes, instances, and poses. One image sample from bocce, croquet, polo, rowing, snowboarding, badminton, sailing, and rock climbing class in the 8 UIUC sports complex event is shown in Fig. 10. The dataset is divided into training and testing datasets randomly. Table 3 shows the results of classification accuracy of various methods based on the events dataset. Fisher kernel [35] and super-vector coding [31] produced relatively higher accuracy compared to the traditional object filtering [43], integrative model [41], expanded BoW [5], and hierarchical matching method [23], but generally outperformed by the BoF-PDF [26] and proposed method. It can be seen that the proposed method achieves the highest accuracy with the events dataset, and outperforms the super-vector coding method [31] even with a short code word. Since the proposed method employs short code word, efficient computation is also possible with the proposed method.

Fig. 10. One sample image from each class of events dataset. Table 3 Performance comparison of events dataset. Method

Accuracy (%)

Object filtering [43]

77.88±1.2

Integrative model [41]

73.40±1.13

Adapted Gaussian model [36]

84.40±0.89

Expanded BoW [5]

84.56±1.5

Hierarchical matching [23]

85.70±1.3

Fisher kernel (w=256) [35]

88.61±1.16

Super-vector coding (w=1024) [31]

90.83±1.06

BoF-PDF (w=256) [26]

90.42±1.03

Proposed (w=256)

91.74±0.68

4.4.3. Experiment on Caltech 101 dataset

To further validate the classification performance of the proposed method, one of the most commonly used benchmark dataset - Caltech 101 database [1] is also used. This publicly available Caltech 101 database consists of 9146 images and a total of 101 different classes. The images are resized to be no larger than 480480 while maintaining the aspect ratio of the images.

In our experiments, the same setting in [17] is used for both training and testing images from this dataset. 15 images from the dataset are randomly selected as testing images, and various sets of 15, 20, 25, and 30 images per class are randomly selected for training. The experiments are carried out to classify the 102 classes (including the background) using a vocabulary size of 600 for the benchmark test. We compare the proposed method with state-ofthe-art image classification algorithms [2, 13, 17, 25, 44-46], and the average image classification accuracy of the Caltech 101 dataset is shown in Fig. 11. The proposed method achieves the highest accuracies of 74.83% and 85.21% with the DAISY descriptor for the 15 and 30 training images, respectively. Since edge-based dense descriptor extracts highly discriminative features, the proposed method significantly performs better in this dataset. The discrimination between different classes is further enhanced by the kernel trick method in the feature mapping, histogram normalization, and image score method.

Fig. 11. Algorithm comparison based on Caltech 101.

4.4.4.Experiment on PPMI dataset

Next, we perform our comparison using the musical instruments-PPMI dataset, which contains images of people playing different musical instruments. One image sample in each class of the PPMI dataset is shown in Fig. 12. 100 images of uniform sizes in each class are used for training and testing. Compared to the above mentioned scene understanding and classification based on scene 15 dataset, more discriminative visual words should be applied to classify the PPMI dataset. Also, the musical instruments of the images provide more descriptive

information than the background of the images. Based on these observations, it is expected that densely computed descriptors perform well with PPMI dataset.

Fig. 12. Example images from the PPMI dataset.

Table 4 presents the classification results for a 12 multi-class classification problems based on the PPMI dataset. The SPM and standard BoW methods achieve poor results, but better results were achieved using SPM extensions such as locality-constrained linear coding (LLC) proposed by Yang et al. in [27]. Compared to BoW and SPM, better performance was obtained by Yao’s grouplet based methods [33]. Grouplet is a new and effective attribute to discriminate highly detailed objects. By capitalizing on grouplet, relatively good classification results were achieved with the combination of SVM. On this note, the discriminative power of grouplet or fusion methods is still not sufficient to produce high accuracy on the PPMI dataset. Reported in [11], the random forest algorithm improved the grouplet+SVM algorithm in [33] by 9.9%. However, the random forest algorithm is too computational intensive due to high iterations and multi-scale images. The proposed method outperforms the grouplet+model, grouplet+SVM as well as the latest random forest+grouplet+SVM algorithms. Since the grouplet methods only uses patches at one scale, methods that employ multi-scale patches such as the proposed, the saliency map + SVM [11], and the multi-scale methods are able to outperform the grouplet methods. Table 4

Performance comparison on PPMI dataset. Methods BoW [1] SPM [2] Grouplet+Model [33] Grouplet+SVM [33] Saliency map +SVM [11] LLC [27] Random forest+grouplet+SVM [37] Proposed (w=256)

Accuracy (%) 78.0±1.35 88.2±0.67 80.12±1.12 85.1±0.83 91.2±0.75 89.2±0.92 92.1±0.77 93.06±0.62

4.4.5.Experiment on soccer dataset

Soccer dataset [40] contains 40 low quality internet images in each class and a total of 7 soccer classes (teams). One image from each team is shown in Fig. 13. In our experiment, the same setting in [40] was adopted, and both the training and testing images are obtained from the same team. The soccer dataset contains a very small number of images, which simplifies the classification task and leads to very high classification results. The classification accuracy of the proposed method on this dataset is 98.7%, whereas the best result for the soccer team in [40] using the colour and shape descriptor was only 89%.

Fig. 13. Example images from the soccer dataset.

4.5. Effect of different kernels

The improvement of performance using two intersection kernels (kl1 and kint), chi-square kernel (kchi2), Jensen-Shannon kernel (kjs) as compared to no kernel (None) is evaluated using the scene 15 dataset. Fig. 14 summarizes the classification results of various kernels using different descriptors and code words. Obviously, the

kernel-based methods outperform the method without any kernel except using the kjs kernel with Phow descriptor when w=256. The main reason for this case is that smoothing effect of Phow has a negative effect on the accuarcy with kjs kernel and low visual word. It can be seen that the classification performance of the intersection kernels (i.e., kl1 and kint) outperforms the non-intersection kernels (i.e., kchi2 and kjs). As kl1 and kint are of the same type of kernel, these kernels exhibit very similar performance. It is also observed that most kernels achieve better results using w = 512 than w = 256. Hence, it is concluded that higher discriminative power and better representation of the features are obtained by using more visual words in these kernels. The DSIFT and DAISY descriptors produce similar results under different kernels, and Phow descriptor generally produces the worst performance. This is attributed to the low pass filter by Gaussian smoothing applied in Phow descriptor which makes the feature less distinctive, and hence reduces the discriminative power and distinguishing ability of the descriptor. 4.6. Effect of different norm methods

Normalization method is a quite effective approach to improve classification accuracy, especially when there are too many variations in each class. This is because the original histogram features (non-normalized histogram) generated by BoP model have too many variations and reduce the discriminative power. Moreover, there are various sizes of patches from the same image generated by the multi-scale partition using SPM model. Using histogram normalization or stretching, the range of the code words is normalized to limited scale to be homogeneous, and hence classification effectiveness and performance are improved.

(a) w = 256.

(b) w = 512. Fig. 14. Performance comparison of different kernels with different length on scene 15 dataset; (a) w = 256; (b) w = 512.

Experiments on four kernels are performed to validate the effects of various linear and non-linear normalization methods. Classification performance of non-linear normalization method (i.e., sqrt norm) and linear normalization methods such as L1, L2, and the proposed L1N and L2N norms is shown in Fig. 15. It is noted that high classification accuracy results are found with the events, PPMI, and soccer datasets using different normalization methods. It can be seen that normalized histograms can substantially improve the classification result than non-normalized histograms. In general, the proposed L2N and L1N normalization methods generally outperform the traditional nonlinear (i.e., sqrt), linear L1 and L2 normalization methods. Furthermore, L2 and L2N normalized methods generally produce better results than L1 and L1N methods, respectively. Since linear SVM classifier is adopted for the classification, linear normalization methods produce relatively better classification results than non-linear normalization methods. As discussed in [17, 22, 24], norm methods have positive impacts on the classification accuracy under different homogeneous kernels, which coincides with our observations in our experiment. Besides, the intersection kernels (i.e., kl1 and kint) perform better than the non-intersection kernels (i.e., kchi2 and kjs), which is demonstrated in Fig. 15 as well.

(a) Events dataset results.

(b) Soccer dataset results.

(c) PPMI dataset results. Fig. 15. Classification accuracy vs. different norm methods on (a) events dataset; (b) soccer dataset; (c) PPMI dataset.

4.7. Computational Cost

The experiments for computational cost evaluation are based on Matlab 2012a, which runs on a computer with the 64 bit Windows Server 2008 operation system, 4 cores CPU at 2.30 GHz (AMD Operaton Processor 6134), and 32G RAM). Note that the proposed algorithm is implemented mainly in Matlab, and the shorter classification time is expected if our algorithm is implemented and optimized based on the compiling language such as C++ or JAVA. In our experiment, the DAISY feature extraction time for a 480480 testing image is 1.0909 seconds, and saliency extraction time for a 480480 testing image is 0.8435 seconds. Table 5 shows the computational time of the proposed method based on five datasets, which are discussed in the earlier sections. The kernel and histogram normalization used for the various datasets in Table 5 are kchi2 kernel and L2N method, respectively. It can be observed that short computational time is required for the classification operation since the saliency based method significantly reduces the computational cost. Although every dataset has different total number of images, almost negligible computation cost is required for kernel mapping, histogram normalization, and testing for the proposed saliency-driven algorithm. These observations indicate that the proposed saliency-driven algorithm is efficient and effective for the image classification. Table 5 Results of average computational cost. Dataset Caltech

Scene 15

Event

Soccer

PPMI

Stage Kernel mapping Histogram normalization Testing Kernel mapping Histogram normalization Testing Kernel mapping Histogram normalization Testing Kernel mapping Histogram normalization Testing Kernel mapping Histogram normalization Testing

Time (Seconds) 0.1331 0.1892 1.2202 0.0821 0.0914 0.692 0.0356 0.0458 0.121 0.0052 0.0015 0.077 0.0584 0.0648 0.213

4.8. Results and discussions

Based on the classification results obtained from popular datasets, the proposed method demonstrates

improved performance over existing methods such as SPM, SPM extensions, BoW extensions, BoF, BoF-PDF, support vector coding, fisher kernel, etc. The following summarize our experimental findings: 1) Multi-scale spatial partition technique is very useful for classification, but it leads to numerous images and class variations. Therefore, histogram normalization is usually applied before classification, which leads to the substantial improvement of classification accuracy rate. The normalization performance is further improved by exploring intra- and inter-class variations. This is consistent with our observations that L2 and L2N normalization methods are more effective than L1 and L1N normalization methods in our experiment. 2) Length of code word is of great significance of the classification accuracy. The longer word length usually produces higher accuracy compared to the shorter ones, but longer word length increases computational time of the classification operation. 3) By exploring the spatial information and saliency, BoP method takes advantage of coarse spatial information to generate the saliency map. BoP method can also capture and aggregate the difference of the feature vectors, and hence much better performance compared to BoF and BoW methods is achieved. 4) In image representation, histogram mining technique and saliency map can produce the highest discriminitivity among classes, and hence the classification performance is boosted by capturing the most important and related information for classification.

5. Conclusions In this paper, a novel saliency-driven image classification method based on the histogram normalization and mining is proposed. The BoP model and image score are obtained from the saliency map, which significantly boosts the classification performance. The image score from saliency is used to optimize the performance of the SVM classifier. Histogram mining and discriminative learning technique are also utilized to enhance the image representation and classification accuracy. Furthermore, novel histogram normalization methods are applied to enhance the classification by reducing the variation in both inter- and intra- classes. Since consistency and finite distribution of histogram features are achieved by normalization, substantial increase of classification accuracy is obtained. Experimental results using five public datasets showed that the proposed method outperforms the existing methods in terms of classification accuracy. Future directions for improving this work include the integration of the recently developed techniques such as low rank, low variance, deep learning, random forest, a scalable vocabulary

tree, sparse coding, feature selection etc.

References [1] Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 524-531 (2005) [2] Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 21692178 (2006) [3] Li, T., Mei, T., Kweon, I.-S., Hua, X.-S.: Contextual Bag-of-Words for Visual Categorization. IEEE Transactions on Circuits and Systems for Video Technology. 21 (4), 381-392 (2011) [4] Zhang, S., Tian, Q., Hua, G., Huang, Q., Gao, W.: Generating descriptive visual words and visual phrases for large-scale image applications. IEEE Transactions on Image Processing. 20 (9), 2664-2677 (2011) [5] Liu, T., Liu, J., Liu, Q., Lu, H.: Expanded bag of words representation for object classification. in: Proceeding of IEEE International Conference on Image Processing, pp.297-300 (2009) [6] van Gemert, J., Veenman, C. J., Smeulders, A. W. M., Geusebroek, J.-M.: Visual Word Ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence. 32 (7), 1271-1283 (2010) [7] Bolovinou, A., Pratikakis, I., Perantonis, S.: Bag of spatio-visual words for context inference in scene classification. Pattern Recognition. 46 (3), 1039-1053 (2013) [8] M. Elfiky, N., Shahbaz Khan, F., van de Weijer, J., Gonzàlez, J.: Discriminative compact pyramids for object and scene recognition. Pattern Recognition. 45 (4), 1627-1636 (2012) [9] Meng, X., Wang, Z., Wu, L.: Building global image features for scene recognition. Pattern Recognition. 45 (1), 373-380 (2012) [10] Zhou, L., Zhou, Z., Hu, D.: Scene classification using a multi-resolution bag-of-features model. Pattern Recognition. 46 (1), 424-433 (2013) [11] Sharma, G., Jurie, F., Schmid, C.: Discriminative spatial saliency for image classification. in: Proceeding of IEEE International Conference on Computer Vision and Pattern Recognition. pp.3506-3513 (2012) [12] Liang, Z., Chi, Z., Fu, H., Feng, D.: Salient object detection using content-sensitive hypergraph representation and partitioning. Pattern Recognition. 45 (11), 3886-3901 (2012)

[13] Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp.1794-1801 (2009) [14] Yang, Y., Newsam, S.: Spatial pyramid co-occurrence for image classification. in: Proceedings of IEEE International Conference on Computer Vision. pp.1465-1472 (2011) [15]Zhang, C., Wang, S., Huang, Q., Liu, J., Liang, C., Tian, Q.: Image classification using spatial pyramid robust sparse coding. Pattern Recognition Letters. 34 (9), 1046-1052 (2013) [16] Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence. 27 (10), 1615-1630 (2005) [17] Vedaldi, A., Zisserman, A.: Efficient Additive Kernels via Explicit Feature Maps. IEEE Transactions on Pattern Analysis and Machine Intelligence. 34 (3), 480-492 (2012) [18] Tola, E., Lepetit, V., Fua, P.: DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence. 32 (5), 815-830 (2010) [19] Lei, B., Thing, V. L. L., Chen, Y., Lim, W.-Y.: Logo Classification with Edge-based DAISY Descriptor. in: Proceedings of IEEE International Symposium on Multimedia. pp. 222-228 (2012) [20] Huang, Y., Huang, K., Yu, Y., Tan, T.: Salient coding for image classification. in: Proceeding of IEEE International Conference on Computer Vision and Pattern Recognition1753-1760 (2011) [21] Zhang, J., Marszałek, M., Lazebnik, S., Schmid, C.: Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. International Journal of Comput Vision. 73 (2), 213-238 (2007) [22] Maji, S., Berg, A. C., Malik, J.: Efficient Classification for Additive Kernel SVMs. IEEE Transactions on Pattern Analysis and Machine Intelligence. 35 (1), 66-77 (2013) [23] Bo, L., Ren, X., Fox, D.: Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms. in: Proceedings of NIPS, pp. 2115-2123 (2011) [24] Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object detection. in: Proccedings of International Conference on Computer Vision. pp. 606-613 (2009) [25] Gehler, P., Nowozin, S.: On feature combination for multiclass object classification. in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 221-228 (2009)

[26] Kobayashi, T.: BoF meets HOG: Feature Extraction based on Histograms of Oriented p.d.f Gradients for Image Classification. in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp.747-754 (2013) [27] Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3360-3367 (2010) [28] Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. in: Proceedings of IEEE International Conference on Computer Vision. pp. 2486-2493 (2011) [29] Lin, L., Luo, P., Chen, X., Zeng, K.: Representing and recognizing objects with massive local image patches. Pattern Recognition. 45 (1), 231-240 (2012) [30] Hou, X., Harel, J., Koch, C.: Image signature: Highlighting sparse salient regions. IEEE Transactions on Pattern Analysis and Machine Intelligence. 34 (1), 194-201 (2012) [31] Zhou, X., Yu, K., Zhang, T., Huang, T.: Image Classification Using Super-Vector Coding of Local Image Descriptors. in: Proceedings of European Conference on Computer Vision. pp. 141-154 (2010) [32] Fernando, B., Fromont, E., Tuytelaars, T., "Effective use of frequent itemset mining for image classification," in: Proceedings of the 12th European Conference on Computer Vision, (2012) [33] Yao, B., Fei-Fei, L.: Grouplet: A structured image representation for recognizing human and object interactions. in: Proceeding of IEEE International Conference on Computer Vision and Pattern Recognition. pp. 9-16 (2010) [34] Quattoni, A., Torralba, A.: Recognizing indoor scenes. in: Proceedigns of IEEE Conference on Computer Vision and Pattern Recognition. pp. 413-420 (2009) [35] Perronnin, F., Dance, C.: Fisher Kernels on Visual Vocabularies for Image Categorization. in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp.1-8 (2007) [36] Dixit, M., Rasiwasia, N., Vasconcelos, N.: Adapted Gaussian models for image classification. in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp.937-943 (2011) [37] Yao, B., Khosla, A., Fei-Fei, L.: Combining randomization and discrimination for fine-grained image categorization. in: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. pp.1577-1584 (2011)

[38] Yang, J., Jin, Z., Yang, J. Y., Zhang, D., Frangi, A. F.: Essence of kernel Fisher discriminant: KPCA plus LDA. Pattern Recognition. 37 (10), 2097-2100 (2004) [39] Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. in: Proceedings of NIPS. pp.545-552 (2007) [40] Weijer, J. v. d., Schmid, C.: Applying Color Names to Image Description. in: Proceedings of IEEE International Conference on Image Processing. pp. III - 493-III - 496 (2007) [41] Li, L.-J., Li, F.-F.: What, where and who? Classifying events by scene and object recognition. in: Proceedings of IEEE International Conference on Computer Vision. pp. 1-8 (2007) [42] Boureau, Y. L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 2559-2566 (2010) [43] Li, L.-J., Su, H., Lim, Y., Fei-Fei, L., "Objects as Attributes for Scene Classification," in Trends and Topics in Computer Vision. vol. 6553, pp. 57-69 (2012) [44] Zhang, H., Berg, A. C., Maire, M., Malik, J.: SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 2126-2136 (2006) [45] Griffin, G., Holub, A., Perona, P., Caltech-256 object category dataset: California Institute of Technology, 2007. [46] Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based image classification. in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp.1-8 (2008)

A new saliency-driven bag-of-phrase approach for image classification is proposed. ->Edge-based dense descriptor is applied. -> Histogram mining and discriminative learning are investigated.-> Image score is adopted as latent information to optimize linear classifier.-> Novel inter- and intra-class histogram normalization methods are explored.

Baiying Lei received her M. Eng degree in electronics science and technology from Zhejiang University, China in 2007, and Ph.D. degree from Nanyang Technological University (NTU), Singapore in 2013. She is currently with National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Medicine, Shenzhen University, P. R. China. Her current research interests include audio/image signal processing, digital watermarking and encryption, machine learning and pattern recognition.

Ee-Leng Tan received his BEng (1st Class Hons) and Ph.D. degrees in Electrical and Electronic Engineering from Nanyang Technological University in 2003 and 2012, respectively. Currently, he is with NTU as a Research Fellow. His research interests include image/audio processing and real-time digital signal processing.

Siping Chen received his Ph.D. degree in biomedical engineering from Xi'an Jiaotong University in 1987. He is currently a Professor in Department of Biomedical Engineering, School of Medicine, Shenzhen University and Director of National-Regional Key Technology Engineering Laboratory for Medical Ultrasound. His research interests include ultrasound imaging, digital signal processing and pattern recognition. Dong Ni received his Ph.D. degree in computer science and engineering from the Chinese University of Hong Kong in 2009. He is currently an Associate Professor in Department of Biomedical Engineering, School of Medicine, Shenzhen University. His research interests mainly include ultrasound image analysis, image guided surgery and pattern recognition. Tianfu Wang received his Ph.D. degree in biomedical engineering from Sichuan University in 1997. He is currently a Professor in Department of Biomedical Engineering, School of Medicine, Shenzhen University and the Associate Chair of School of Medicine. His research interests include ultrasound image analysis, medical image processing, pattern recognition and medical imaging.