Hierarchically engineering quality-related perceptual features for understanding breast cancer

Hierarchically engineering quality-related perceptual features for understanding breast cancer

J. Vis. Commun. Image R. 64 (2019) 102644 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/loca...

1MB Sizes 0 Downloads 34 Views

J. Vis. Commun. Image R. 64 (2019) 102644

Contents lists available at ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

Hierarchically engineering quality-related perceptual features for understanding breast cancer q Xusheng Wang ⇑, Xing Chen, Congjun Cao Xi’an University of Technology, Xi’an, China

a r t i c l e

i n f o

Article history: Received 16 June 2019 Revised 8 September 2019 Accepted 9 September 2019 Available online 10 September 2019 Keywords: Breast cancer Deep learning Quality-related Weakly-supervised Ranking algorithm

a b s t r a c t Breast cancer is generally acknowledged as the second leading cause of cancer death among women. Therefore, accurately understanding breast cancer from X-ray images is an indispensable technique in medical sciences and image analysis. In the work, we propose a novel perceptual deep architecture that hierarchically learns deep features from large-scale X-ray images, wherein human visual perception is naturally encoded. More specifically, given a rich number of breast cancer images, we first employ the well-known BING objectness measure to identify all possible visually/semantically salient patches. Due to the relatively huge number of BING object patches, a weakly-supervised ranking algorithm is designed to select high quality object patches according to human visual perception. Subsequently, an aggregation scheme is utilized to derive the deep features of high quality object patches within each brain cancer image. Based on the aggregated deep feature, a multi-class SVM is trained to classify each breast cancer into multiple levels. Extensive comparative studies and visualization results have demonstrated the effectiveness and efficiency of our proposed deep architecture. Ó 2019 Elsevier Inc. All rights reserved.

1. Introduction Nowadays, breast cancer has become a pervasive diagnosed cancer in women. Each year, it is has been reported that over one million women throughout the world will be diagnosed with breast cancer and many will die. Therefore, it is necessary to design an image recognition technique to effectively discover the abnormal breast parts at the earliest convenience. Generally, if the abnormal regions are detected in an early stage, the breast cancer will be benign the patient will have sufficient time for treatment. Otherwise, the breast cancer will turn out to be virulent, which is much more difficult for the patient to find a cure. In the computer vision and machine intelligence, although remarkable performance has been achieved in object recognition, it is still challenging to fully automatically recognize brain breast from X-ray pictures due to the following three reasons: 1) There are tens of visually/semantically salient object parts (e.g., the normal breast tissue and abnormal parts) within a high-resolution X-ray image. These parts can reflect the process of how experienced doctors understand each X-ray

q

This paper has been recommended for acceptance by Zicheng Liu.

⇑ Corresponding author.

E-mail address: [email protected] (X. Wang). https://doi.org/10.1016/j.jvcir.2019.102644 1047-3203/Ó 2019 Elsevier Inc. All rights reserved.

image. Obviously, they are informative for recognizing each breast X-ray image. However, But it is difficult to develop a computational model that extracts all these salient object parts s and further discovers the sequence of human gaze allocated on them. Potential problems include how to design a quality-related metric that accurately selects the abnormal regions that well mimics human visual understanding? 2) Owing to the competitive performance of deeply-learned visual features in object representation, it is believable that deep human gaze behavior features are highly descriptive to breast cancer. Noticeably, the sequence of gaze movements and the gaze shifting path’s geometry both characterize the process of humans perceiving each breast cancer. To our best knowledge, existing deep models can only represent each image or its internal regions, while the gaze shifting sequence and the path geometry remain undiscovered. Mathematically, jointly incorporating the above two clues into a unified deep visual recognition framework is a challenge. 3) In practice, each breast cancer image is accompanied with a series of image-level attributes (e.g., cancer level, tissue labels, and patient tag). They are highly informative to human visual understanding of breast cancer. Thus, it is valuable to incorporate these weak attributes during human gaze behavior modeling. Moreover, the inherent correlations

2

X. Wang et al. / J. Vis. Commun. Image R. 64 (2019) 102644

of breast’s regional features distributed on manifold is another attribute that should be encoded for visually/ semantically salient object patches discovery. Collaboratively and seamlessly encoding these weak attributes into an efficient and solvable model for gaze behavior learning is a tough challenge. To handle these problems, we propose a ranking algorithm to model human gaze shifting sequence (GSS), based on which a deep architecture is engineered to encode the detected GSS. An illustration of our proposed recognition pipeline is displayed in Fig. 1. Given massive-scale breast images, in order to extract all the possible abnormal parts, we employ the BING (binarized normalized gradients) [1] object patches from each. Generally, there are hundreds of BING object patches produced from each breast image. And it is computationally intolerable to characterize all of them. Thus, a weakly-supervised ranking algorithm is developed to intelligently select a few high quality salient object patches, wherein multiple weak attributes can be optimally integrated. Thereby, we build GSS by sequentially connecting the top-ranking high quality object patches. Subsequently, a hierarchical aggregation network is designed to calculate the deep representation for each GSS. Based on the deep GSS representation, a multi-class SVM classifier is trained to categorize each breast cancer image into multiple types. In summary, the key contributions of our work can be briefed as follows: 1) we propose to automatically learn human gaze behavior for detecting breast cancer; 2) we propose a novel weaklysupervised ranking algorithm to select qualified object patches for mimicking human gaze allocation; 3) a continuously updating breast cancer image set containing thousands of images, based on which comprehensive experiments are conducted to validate the competitiveness of our approach.

2. Related work Our work is closely related to two research topics in image understanding: deep/semantic image modeling and visual quality assessment. A few representative work in the literature are reviewed as follows. Multi-layer CNNs make establishing visual recognition models toward million-scale image set like ImageNet [2] tractable. Krizhevsky et al. [3] trained largescale CNNs using a portion of ImageNet [4], wherein impressive image classification accuracy was observed. Selective search [5] combines exhaustive search and semantic segmentation in order to improve [1]. It generates a concise set of data-driven and category-free image regions. Girshick et al. [6] proposed the regions with CNN features (R-CNN), where the core technique is an intelligent image regions sampling scheme. Moreover, Zhou et al. [7] enhanced the CNN-based scene classification by searching

high quality training samples. Wu et al. [8] proposed a preprocessing step to facilitate the deep scene classification model. A pre-trained deep CNN is designed by employing locally distributed and discriminative meta objects. He et al. [9] pioneered the ResNet, a residual learning paradigm to facilitate the training of significantly deeper neural network than the standard ones. Further in [10], Wu et al. introduced the BlockDrop to improve the ResNet [11]. It adaptively activates the deep network layers during inference, based on which the computational consumption is greatly reduced without performance loss. Many semantic models were proposed to understand aerial photos, either at image-level or region-level. For image-level annotation, Zhang et al. [12] proposed the so-called graphlets to represent an aerial photo geometrically, based on which a discriminative kernel machine is calculated for aerial photo categorization. In [13], Xia et al. designed a weakly-supervised semantic encoding technique to annotate aerial photos at region-level, where a multimodal hashing algorithm is adopted to rapidly calculate the image kernel for categorizing aerial photos. Akar et al. [14] leveraged the rotation forest and objectlevel features for categorizing aerial photos. Sameen et al. [15] designed a deep CNN for classifying high-resolution aerial photos containing urban areas, wherein optical bands, digital surface, and ground-truth maps are combined collaboratively. Cheng et al. [16] used a pre-trained deep CNN for categorizing high-resolution aerial images, wherein the model is fine-tuned by a domain-specific scenery set. Fu et al. proposed to localize fine-grained aircraft from high-resolution aerial photos, where a multi-class activation mapping system is designed to detect aircraft’s parts. In [17], Wang et al. formulated a multi-scale end-toend deep network for visual attention detection inside aerial photographs. Yang et al. [30] formulated a focal loss deep neural network for detecting vehicles from aerial photos, where the skip connection is leveraged. In [15], Costea et al. proposed to geolocalize aerial photos by automatically identifying roads and intersections. In the past decade, a rich number of deep architectures have been proposed to assess image quality. Karayev et al. [8] shown that even deep features learned using object-level labels (rather than standard quality scores) can still perform competitively. Lu et al. [10] proposed a two-column deep architecture that encodes multi-source inputs both globally and locally. In addition, Lu et al. [9] designed a novel aggregation-based deep network for image quality evaluation. A series of random image patches are integrated during each sub-netowrk training accordingly, which are subsequently aggregated using either a statistic or sorting layer. Tang et al. [14] proposed a novel image quality assessment method that derives a nonlinear kernelized regression algorithm by leveraging rectifier deep network in a semi-supervised way. The model is pre-trained by a rich number of unlabeled images in an unsupervised manner and then fine-tuned by images labeled using human preference scores. Mai et al. [16] developed a deep neural network that preserves the image global composition, by

Fig. 1. An overview of our designed deep model for breast cancer understanding.

3

X. Wang et al. / J. Vis. Commun. Image R. 64 (2019) 102644

learning deep quality features from images with the original aspect ratios/sizes directly. Kao et al. [17] formulated a multi-task deep model to conduct image styles prediction. By leveraging the CNNs, a multi-task framework is designed to efficiently exploit the weak supervision at both quality-level and semantics-level. Meanwhile, the correlation between visual recognition and quality assessment is characterized by learning the inter-task relationships. Further in [18], Ma et al. proposed a layout-aware deep network for image quality assessment. It supports arbitrary-sized images and incorporates part-based image details and image spatial configurations simultaneously. 3. Our proposed approach 3.1. Ranking by weakly-supervised quality metric Each breast image involves a rich number of fine-grained object parts, which collaboratively determines its cancer level. To successfully extract all these object parts, we employ the wellknown BING [19] object proposals generation algorithm. The BING exhibits three attractive advantages in producing object patches. First, it consumes extremely low time cost while receives an ultra-high object detection accuracy. Second, it can produces a low redundant set of object patches that are particularly suitable for constructing GSS to mimic human gaze behavior. Third, it has a nice generalization capacity to unseen object categories, based on which our breast cancer recognition model is robust cross different cancer levels. Weakly-supervised ranking: There are usually hundreds of object patches distributed within a breast image. According to the human visual perception theory, only a few visually/semantically salient object patches well attract human attention, while the rest are kept nearly unprocessed. Based on this, we propose a weaklysupervised ranking which incorporates: 1) aerial photos’ weak attributes and 2) object patches’ intrinsic distribution on manifold. 1) Generally speaking, each breast image has multiple weak attributes: patient age, gender, and profession. Given there are K types of attributes totally, we can build a Kdimensional feature vector to represent such weak attributes. Theoretically, we adopt a 0–1 real number to describe the attribute’s degree. In the objective function of our ranking algorithm, we have to encode such attribute vector into the various object patches within each breast image. Mathematically, the following object function can be obtained:

X CðB; SÞ ¼ akBk21 þ b i /ðyi ; ki Þ ¼ akBk21 þ bkBT X  Sk;

ð1Þ

where the U  V transformation matrix B is learned over all the object patches. It linearly projects each object patch to its ranking score, xi denotes the appearance feature vector of each object patch and ri represents the ranking score. We denote M the training object patches number and R their dimensionality. We represent matrix X as the matrix constructed by xi and matrix R as the matrix constructed by yi . Furthermore, we denote r c ¼ 1 if the i-th object patch is considered as the c-th ranking level, otherwise we set it to zero. The regularized term akBk21 is employed to make sure that matrix B is sparse in rows. Herein, we adopt the l21 norm regularization because it can avoid the highly redundant features and remove those noisy features automatically. In addition, we term bkBT X  Sk is a loss function to effectively control the ranking accuracy. 2) It is generally accepted that, spatially adjacent object patches are prone to have shared regions. Therefore, object patches coming from each breast image are strongly correlated in their feature space. Based on the manifold theory,

it is beneficial to exploit the underlying relationships among spatially adjacent object patch during weakly supervised ranking. In our implementation, we incorporate the manifold geometric structure in our proposed ranking technique using two step: Step 1: we obtain an T  T-sized affinity graph that is built from T object patches extracted from each breast image based on BING operator. Compared to the traditional kNN graph that   requiring O T2 time complexity to construct the kNN graph, we only link pairwise object patches having shared regions during graph construction. This brings two benefits: 1) capturing the differences among object patches and its spatially adjacent ones can preserve the global spatial configurations implicitly; Step 2: Our kNN graph construction scheme is much more efficient than its competitors since there is no need to enumeratively compare with all the candidate object patches. Denoting there are P spatially adjacent object patches, we time consumption of building our kNN graph is OðPTÞ. Contrastively,   it will take O PT2 to build the kNN graph using conventional method. By constructing the kNN graph, we can formulate our manifold ranking algorithm as:

XðRÞ ¼

XT jk

Mjk k

  1 1 2 r i  r i k þ ukr i  b1 k ¼ tr RT KR þ ukR  Tk; qii pii ð2Þ

where M denotes the Gaussian kernel matrix representing the distance of each object patch and its spatial neighbors, matrix K is 1

1

diagonal what is built upon matrix M, K ¼ M2 ðM  WÞM2 , and T is the ranking initialization matrix. By combining the above two optimization functions, we obtain the final objective function for our ranking algorithm as:

  min akBk21 þ bkBT X  Sk þ tr RT KR þ ukR  Tk; s:t:; RT R ¼ I R;B

ð3Þ We observe that it is impossible to solve (3) in an analytic way. In our implementation, we adopt the well-known concaveconvex procedure (CCCP) [15] to calculate the two parameters iteratively 3.2. Ranking-based GSS deep encoding Based on our proposed weakly-supervised ranking algorithm, we notice that the top-ranking high quality object patches are visually/semantically salient which attracts human visual attention, as exemplified in Fig. 2. Particularly, humans will first attend to the most salient object patch, and then shift their gaze to the second salient one, and so on. Based on this observation, we derive the GSS to mimic human sequentially perceiving multiple breast parts within an image. After building the GSS, supposing there are A object patches along the GSS, we first extract the 128-dimensional ImageNetCNN feature [x] to deeply characterize each object patch. Subsequently, we aggregate these patch-level deep features into a (128A)-dimensional deep feature to represent the entire GSS. 3.3. Multi-class kernel SVM for breast cancer classification Based on the deep GSS features obtained above, we train a multi-class SVM is learned for categorizing breast cancer into multiple levels. Mathematically, for training breast images coming from the i-th cancer level and the j-th cancer level, we formulate the following binary SVM as:

4

X. Wang et al. / J. Vis. Commun. Image R. 64 (2019) 102644

Fig. 2. GSS construction by linking high quality object patches and deeply representing each GSS.

max Z ðxÞ ¼ x2R

s:t:;

XM i¼1

xi 

  1 XM x x a a h Fi; Fj i¼1 i j i j 2

0  xi  H;

XM i¼1

xi a i ¼ 1

ð4Þ

where Fi denotes the deep GSS feature extracted from the i-th breast image; ai denotes the class label for the i-th breast image; and H determines the complexity of the machine computational ability off the non-separable breast images; and M counts the number of breast images from both classes. After learning the SVM classifier, given a new deep GSS feature F from a test breast image, its cancer level is calculated as follows:

XM  sgn x a hðF i ; F  Þ þ Q ; i¼1 i i

ð5Þ

P s s where the basis Q is calculated as Q ¼ 1  M i¼1 xi ai hðF i ; F Þ, and F is a support vector calculated from the j-th label. During SVM classification, given B cancer levels, there will be BðB  1Þ=2 times binary classification conducted. And the voting rule is utilized to make the last decision. Specifically, each binary classification can be considered as a voting operation and the final result is decided by the cancer level receiving the largest number of votes. By summarizing our introduction within this section, the pipeline of our proposed deep quality-related breast cancer recognition can be briefed in Algorithm 1. Algorithm 1. (Quality-related Perceptual Feature for Breast Cancer Recognition). Input: Massive-scale labeled breast images, the cancer level number, the number of object patches inside each GSS; Output: Calculated cancer level for a test breast image; 1) Generating the BING-based object patches for each breast image, and then learn the ranking model for GSS construction (Section 3.1); 2) Calculate the deep GSS feature by the aggregation neural network (Section 3.2); 3) Train a multi-class SVM classifier based on (4) for predicting the cancer level of a new breast image (Section 3.3).

work, we spent tremendous human resources to crawl nearly 2.7 M breast X-ray photos from the one hundred hospitals in China. Specifically, it is observable that breast X-ray photos from metropolises hospitals are typically clearer, detailed, and with correct white balance. Therefore, a crawler is developed to download and crop massive-scale breast images from the top 100 hospitals throughout our country. Typical resolutions of these X-ray breast photos are between 100  100 and 1200  1200. In our implementation, we consider an breast X-ray photo whose resolution is smaller than 400  400 as low resolution, while that larger than 800  800 as high resolution. Based on such criteria, we collect 0.79 M high-resolution X-ray breast photos and 1.51 M lowresolution ones. Different from our designed crawler that are executed fully-automatically, each high-quality breast photos and the corresponding low-quality counterparts are handled in a semiautomatic way. Specifically, we record the patients’ age, education, and occupation of each crawled X-ray breast photo. A few example photos from our data set are presented in Fig. 3. In our scenario, we recruit 28 master/Phd students (10 female and 18 male) from our university. They worked ten hours every working day to double-check the possibly high quality X-ray breast photos. The double-checking takes two weeks. Finally, about 0.79 M high-quality X-ray breast images associated with 1.51 M low-resolution counterparts are obtained. To semantically annotate each breast photo at image-level, we inspected these 20 breast cancer levels and summarized the five most discriminative types. For our X-ray breast photo set, each breast photo is typically associated with one to four semantic tags. In our work, the semantic tags of each X-ray breast photo are annotated semiautomatically. First of all, part detectors corresponding to the 10 discriminative breast cancer levels are trained, which are further utilized to label the cancer level of the remaining breast photos. To refine these cancer levels, the same 28 master/Phd students carefully double-check each breast photo. It is worth emphasizing that, our million-scale breast photo set is updated continuously. The current version is implemented by the recruited master/Phd students during four months. They will be recruited for one year to continuously collect and refine the breast photo set. The final version photo set will contain over two million low/high-quality breast photos, each of which is carefully annotated at image-level.

4. Experimental results and analysis

4.2. Comparative study

In this section, we evaluate the performance of our method. We first introduce our complied massive-scale breast image set. Then, we compare our method with a set of deep/shallow recognition models. Finally, we analyze the important parameters in our proposed recognition model.

In the first place, a comparison is made between our method with a number of shallow models for breast image recognition: 1) walk and tree kernel with fixed length: FLWK [20] and FLTK [20]; 2) feature histogram with multiple resolutions (MRH) [21]; 3) classic multi-layer pyramid kernel for image matching (SPM) with the three enhanced versions: SPM-LLC, SPM-SC, and SPMOB; 4) super vector for image coding (SV) [22] and supervised image encoding for image representation (SSC) [23]. The inherent parameters from the compared algorithms are determined in the following. We adjust the length of FLWK and FLTK from one to

4.1. Datasets description To our best knowledge, there is no publicly available large-scale fully-annotated breast cancer data set in the literature. In this

5

X. Wang et al. / J. Vis. Commun. Image R. 64 (2019) 102644

Fig. 3. Sample breast images from five cancer levels in our compiled data set.

15 until the best recognition accuracy is achieved. All the experimental breast images are processed by leveraging a Gaussian kernel distribution (sigma = 1) by employing 15 grey degrees. Given the SPM and the corresponding enhanced version, all the training breast images are adopted for calculating ten million SIFT points, wherein the grid size is set to 20  20 and the space between pairwise grids is fixed to ten pixels. Afterward, an 800-sized codebook is calculated using the well-know hierarchical clustering. Due to the attractive accuracy of deep learning techniques in the past decade, it is also meaningful to compare our algorithm with multiple deeply-learnd recognition baseline models, that is, CNN based on SPM (CNN-SPM) [24], CleanNet [25], deep transferable model [26], as well as three popular semantically-guided generic recognition models [27]. It is observable that, except for [22–24], the C++ codes of the compared deep recognition model are accessible from public website. In this way, there is no need to implement them and we use the default settings directly. For [12], 256 to 1024 region proposals produced by MCG [22] are chosen firstly. For [16], the sub deep architectures are all trained by CIFAR-10 [23], and subsequently the controller RNN is trained by the PRO, i.e., a global workqueue system produces multiple RNN-controlled subnetworks. For [9], the deep network’s inputs are uniformly resized into 448  448. The network structure depends on the 16-layer VGGNet [12], wherein the DFL component is incorporated into conv4-3. For [11], the inputs of multi-multilayer CNN-RNN are resized into 224  224 to facilitate the VGG-f model training. During the deep feature learning, the publicly available MatExConNet software [13] and the ImageNet-based deep model are employed. And the learned deep features are categorized using a linear Multiclass SVM. Noticeably, all our competitors can only handle a single breast image. In our setting, we combine the high-resolution breast image and its low-resolution counterparts as a new one for visual recognition (i.e., the low-resolution breast images are resized to the same width/height of the high-resolution one). As shown in Tables 1 and 2, we conducted a comparative study on the aforementioned flat/deep recognition algorithm. Each operation is repeatedly conducted 10 times and the average recognition

Table 1 Comparative accuracy on different shallow recognition models. Category

FLWK

FLTK

SPM

SPM-LLC

Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer

0.8943 0.7433 0.5673 0.6565 0.7687 0.8121 0.7645 0.7832 0.8323 0.7756 0.6732 0.7892 0.6211 0.8943 0.7832

0.9021 0.7632 0.5873 0.6743 0.7843 0.8332 0.7832 0.8094 0.8554 0.7912 0.6833 0.7733 0.6321 0.8911 0.7992

0.9112 0.7882 0.6044 0.6654 0.7882 0.8432 0.7921 0.7998 0.8661 0.8002 0.6993 0.7765 0.6443 0.8882 0.8001

0.8645 0.8034 0.7033 0.7845 0.7912 0.8557 0.8043 0.8121 0.8565 0.8087 0.7102 0.7884 0.6632 0.8776 0.8054

level 1 level 2 level 3 level 4 level 5 level 6 level7 level 8 level 9 level 10 level 11 level 12 level 13 level 14 level 15

Category

SPM-SC

SPM-OB

SV

SSC

Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer

0.7866 0.7032 0.6732 0.8732 0.8011 0.7832 0.8032 0.8213 0.7643 0.8832 0.8765 0.8321 0.8432 0.8843 0.8993

0.7545 0.7321 0.6884 0.8545 0.8112 0.7998 0.8121 0.8334 0.7832 0.8932 0.8843 0.8451 0.8546 0.8765 0.8739

0.7765 0.7768 0.7032 0.8667 0.8321 0.8012 0.8324 0.8554 0.7993 0.8876 0.8673 0.8543 0.8665 0.8703 0.8436

0.7889 0.8432 0.7987 0.8767 0.8556 0.8231 0.8546 0.8678 0.8032 0.8954 0.8765 0.8654 0.8732 0.8893 0.8678

level 1 level 2 level 3 level 4 level 5 level 6 level7 level 8 level 9 level 10 level 11 level 12 level 13 level 14 level 15

accuracies are presented. As can be seen, on the 15 breast cancer levels, the best accuracies are also achieved by our method. 4.3. Parameter analysis Totally, there are two sets of adjustable parameters that should be evaluated. They are three weights: a; b; u; and the number of

6

X. Wang et al. / J. Vis. Commun. Image R. 64 (2019) 102644

Table 2 Comparative accuracy on different deep recognition models. Category

SPP-CNN

CleanNet

DTA

DFB

Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer

0.8654 0.7765 0.6045 0.6676 0.7732 0.8231 0.7787 0.7904 0.8453 0.7943 0.7032 0.7893 0.6564 0.9121 0.8094

0.8743 0.7786 0.5908 0.6782 0.7676 0.8343 0.7903 0.8121 0.8675 0.7876 0.6903 0.7879 0.6456 0.8893 0.8102

0.8954 0.7954 0.6121 0.6787 0.7987 0.8565 0.8102 0.8043 0.8768 0.8121 0.7005 0.7964 0.6546 0.8987 0.8212

0.8876 0.8121 0.7213 0.7956 0.8094 0.8658 0.8124 0.8215 0.8325 0.8165 0.7265 0.7968 0.6787 0.8767 0.8121

level 1 level 2 level 3 level 4 level 5 level 6 level7 level 8 level 9 level 10 level 11 level 12 level 13 level 14 level 15

Category

Mesnil

Xiao

Cong

Ours

Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer Cancer

0.8654 0.7876 0.6121 0.6768 0.7879 0.8321 0.7876 0.8092 0.8453 0.8092 0.7121 0.7902 0.6654 0.9121 0.8121

0.8876 0.7879 0.6054 0.6879 0.7768 0.8436 0.8093 0.8232 0.8786 0.7906 0.7084 0.7987 0.6567 0.8905 0.8143

0.8654 0.7768 0.6231 0.6804 0.8045 0.8614 0.8165 0.8121 0.8675 0.8205 0.7089 0.8043 0.6652 0.9012 0.8276

0.9043 0.8432 0.7121 0.8654 0.8987 0.9102 0.8657 0.8321 0.8987 0.8754 0.8543 0.8993 0.7832 0.9211 0.9092

level 1 level 2 level 3 level 4 level 5 level 6 level7 level 8 level 9 level 10 level 11 level 12 level 13 level 14 level 15

The bold values denote the best performer.

object patches inside each GSS K. In the following, we will report the influence of each parameter on our breast cancer recognition performance. It is noticeable that, before testing each parameter, we use cross validation to set the default values of these parameters, i.e., a ¼ b ¼ u ¼ 0:1, and K ¼ 5. First of all, we adjust each of a; b; u from zero to one with a step of 0.05. Meanwhile, the rest parameters remain unchanged. As shown in Fig. 4, when increasing the value of each of a; b; u. The recognition accuracy increases and then peaks. Afterward, the recognition accuracy decreases to a low level. This observation uncovers that maintaining an appropriate value of a; b; u is an optimal choice. Afterward, we tune K from

Fig. 5. Recognition accuracy by varying K.

one to 10. Similarly, the average recognition accuracy is reported in Fig. 5. As can be seen, recognition accuracy increases significantly when K is tuned from one to five, and subsequently remains stable. This observation indicates that K = 5 is sufficiently representative to our massive-scale breast image set. Based on such observation, since K determines the object patch number during GSS construction, we set K = 5 in order to balance the effectiveness and effectiveness of our breast image recognition system. 5. Conclusions This paper proposes novel perceptual deep architecture that hierarchically learns deep features from X-ray breast images, wherein human visual perception is naturally encoded. We designed a novel GSS construction framework by developing a weakly-supervised ranking algorithm. By calculating a deep representation for each GSS, a multi-class SVM is trained to classify each breast cancer into multiple levels. Extensive comparative studies on our complied breast image set have shown the usefulness of our designed recognition system.

Fig. 4. Recognition accuracy by varying a; b; u.

X. Wang et al. / J. Vis. Commun. Image R. 64 (2019) 102644

Declaration of Competing Interest The authors declared that there is no conflict of interest. References [1] Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, Philip Torr, BING: binarized normed gradients for objectness estimation at 300fps, Proc. of CVPR, 2014. [2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, ImageNet: a large-scale hierarchical image database, Proc. of CVPR, 2009. [3] Alex Krizhevsky, Ilya Sutskever, Geo_rey E. Hinton, ImageNet classification with deep convolutional neural networks, Proc. of NIPS, 2012. [4] Ross Girshick, Je Donahue, Trevor Darrell, Jitendra Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, Proc. of CVPR, 2014. [5] J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, Selective search for object recognition, IJCV 104 (2) (2013) 154–171. [6] Wu Ruobing, Baoyuan Wang, Wenping Wang, Yu Yizhou, Harvesting discriminative meta objects with deep CNN features for scene classification, Proc. of ICCV, 2015. [7] Aude Oliva, Antonio Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, Int. J. Comput. Vision 42 (3) (2001) 145–175, 4. [8] Luming Zhang, Yahong Han, Yi Yang, Mingli Song, Shuicheng Yan, Qi Tian, Discovering discriminative graphlets for aerial image categories recognition, IEEE T-IP 22 (12) (2014) 5071–5084, 3. [9] Yaming Wang, Vlad I. Morariu, Larry S. Davis, Learning a discriminative filter bank within a CNN for fine-grained recognition, Proc. of CVPR, 2018, 7. [10] Ali Caglayan, Ahmet Burak Can, Exploiting multi-layer features using a CNNRNN approach for RGB-D object recognition, ECCV Workshops, 2018, 7. [11] Grégoire Mesnil, Salah Rifai, Antoine Bordes, Xavier Glorot, Yoshua Bengio, Pascal Vincent, Unsupervised learning of semantics of object detections for scene categorizations, Proc. of PRAM, 2015, 7, 8. [12] Yang Xiao, Wu Jianxin, Junsong Yuan, mCENTRIST: A multi-channel feature generation mechanism for scene categorization, IEEE T-IP 23 (2) (2014) 823– 836, 7, 8. [13] Yang Cong, Ji Liu, Junsong Yuan, Jiebo Luo, Self-supervised online metric learning with low rank constraint for scene categorization, IEEE T-IP 22 (8) (2013) 3179–3191. [14] Qiang Yang, Yuqiang Chen, Gui-Rong Xue, Wenyuan Dai, Yu Yong, Heterogeneous transfer learning for image clustering via the social web, Proc. of ACL/IJCNLP, 2009, 9. [15] Chang Wang, Sridhar Mahadevan, Heterogeneous domain adaptation using manifold alignment, Proc. of IJCAI, 2011. [16] Svetlana Lazebnik, Cordelia Schmid, Jean Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, Proc. of CVPR, 2006.

7

[17] Ariadna Quattoni, Antonio Torralba, Recognizing indoor scenes, Proc. of CVPR, 2009. [18] Hu Yao, Debing Zhang, Zhongming Jin, Deng Cai, Xiaofei He, Active learning based on local representation, Proc. of IJCAI, 2013. [19] Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Technical report, University of Toronto, 2009. 7 [20] Maher Ibrahim Sameen, Biswajeet Pradhan, Omar Saud Aziz, Classification of very high resolution aerial photos using spectral-spatial convolutional neural networks, J. Sensors 7195432 (2009). [21] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347, 2017. 7 [22] Pablo Arbelaez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marques, Jitendra Malik, Multiscale combinatorial grouping, Proc. of CVPR, 2014, 7. [23] Si Liu, Shuicheng Yan, Tianzhu Zhang, Xu Changsheng, Jing Liu, Lu Hanqing, Weakly supervised graph propagation towards collective image parsing, IEEE T-MM 14 (2) (2012) 361–373. [24] Jinjun Wang, Jianchao Yang, Yu Kai, Fengjun Lv, Thomas Huang, Yihong Gong, Locality-constrained linear coding for image classification, Proc. of CVPR, 2010, 6. [25] Li-Jia Li, Su Hao, Eric P. Xing, Li Fei-Fei, Object bank: a high-level image representation for scene classification and semantic feature sparsification, Proc. of NIPS, 2010, 6. [26] Son N. Tran, Artur S. dAvila Garcez, Deep logic networks: inserting and extracting knowledge from deep belief networks, IEEE T-NNLS 29 (2) (2018) 246–258. [27] Chen Wang, Xiao Bai, Shuai Wang, Jun Zhou, Peng Ren, Multiscale visual attention networks for object detection in VHR remote sensing images, IEEE GRST 16 (2) (2019) 310–314. Xusheng Wang was born in Shanxi, P.R. China, in 1988. He received the PhD. degree from University of Paris Sud, France. Now, He works in Xi’an University of Technology. His research interest include computational intelligence, big data analysis and integrated circuit design. Xing Chen was born in Shaanxi, P.R. China, in 1999. She received the Bachelor degree from Xi’an University of Technology, P.R. China. Now, she works as a college instructor in Xi’an University of Technology. Her research interests include big data analysis, computational intelligence. Congjun Cao was born in October 1970. She graduated from Northwestern University with Ph.D. in computer Software and Theory. She is currently a Full professor of Xi’an University of Technology in P.R. China. Her research focuses on cross-media colour reproduction, quality control technology, and computational intelligence.