Classification of white blood cells using capsule networks

Classification of white blood cells using capsule networks

Journal Pre-proof Classification of white blood cells using capsule networks ¨ Yusuf Yargı Baydilli, Umit Atila PII: S0895-6111(20)30002-1 DOI: htt...

3MB Sizes 0 Downloads 44 Views

Journal Pre-proof Classification of white blood cells using capsule networks ¨ Yusuf Yargı Baydilli, Umit Atila

PII:

S0895-6111(20)30002-1

DOI:

https://doi.org/10.1016/j.compmedimag.2020.101699

Reference:

CMIG 101699

To appear in:

Computerized Medical Imaging and Graphics

Received Date:

20 March 2019

Revised Date:

24 December 2019

Accepted Date:

3 January 2020

Please cite this article as: Yusuf Yargi Baydilli, Umit Atila, Classification of white blood cells using capsule networks, (2020), doi: https://doi.org/10.1016/j.compmedimag.2020.101699

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier.

Classification of white blood cells using capsule networks ¨ Yusuf Yargı Baydillia,∗, Umit Atilaa a Department

of Computer Engineering, Faculty of Engineering, Karab¨uk University, Karab¨uk, Turkey

Abstract

lP

re

-p

ro of

Background:While the number and structural features of white blood cells (WBC) can provide important information about the health status of human beings, the ratio of sub-types of these cells and the deformations that can be observed serve as a good indicator in the diagnosis process of some diseases. Hence, correct identification and classification of the WBC types is of great importance. In addition, the fact that the diagnostic process that is carried out manually is slow, and the success is directly proportional to the expert’s skills makes this problem an excellent field of application for computer-aided diagnostic systems. Unfortunately, both the ethical reasons and the cost of image acquisition process is one of the biggest obstacles to the fact that researchers working with medical images are able to collect enough data to produce a stable model. For that reasons, researchers who want to perform a successful analysis with small data sets using classical machine learning methods need to undergo their data a long and error-prone pre-process, while those using deep learning methods need to increase the data size using augmentation techniques. As a result, there is a need for a model that does not need pre-processing and can perform a successful classification in small data sets. Methods: WBCs were classified under five categories using a small data set via Capsule Networks, a new deep learning method. We improved the model using many techniques and compared the results with the most known deep learning methods. Results: Both the above-mentioned problems were overcame and higher success rates were obtained compared to other deep learning models. While, Convolutional Neural Networks (CNN) and Transfer Learning (TL) models suffered from over-fitting, Capsule Networks learned well training data and achieved a high accuracy on test data (96.86%). Conclusion: In this study, we briefly discussed the abilities of Capsule Networks in a case study. We showed that Capsule Networks are a quite successful alternative for deep learning and medical data analysis when the sample size is limited.

ur

1. Introduction

na

Keywords: Medical image analysis, white blood cells (WBC), deep learning, capsule networks, classification

Jo

The blood circulating in the veins of a healthy person is about 7% of the body weight. When the structure of approximately 5 liters of blood we have is examined under a microscope, it is seen that it is formed of fluids called plasma and various cells. By volume, 45% of these cells composed of red blood cells (RBC) and 1% white blood cells (WBC) [1]. Each cell type found in the blood has a role to play in maintaining the life functions of man. For example, RBCs are responsible for oxygen transport and WBCs are an important part of the human immune system. The vital importance of blood in human life provides very important information about the health status of the person. Even a single drop of blood contains thousands of organic and inorganic components. This means that the patient’s condition can be clearly interpreted by performing various blood tests with little cost. One of these tests on blood is peripheral blood smear. In this test, blood is spread on a film slide and floored with a specific staining method, ∗ Corresponding

author ¨ Email addresses: [email protected] (Yusuf Yargı Baydilli), [email protected] (Umit Atila)

Preprint submitted to Computerized Medical Imaging and Graphics

December 25, 2019

Jo

ur

na

lP

re

-p

ro of

then, examined under a microscope. Since different cell types react differently to the dye, the WBC sub-types can be identified and separated from RBCs and other components [2]. Some variabilities observed at characteristics of WBCs; such as, count, ratio, shape, etc., as may be a precursor of some diseases (e.g. leukemia), abnormalities observed in the sub-types of cells may also be symptoms of some diseases (e.g. neutrophilia) [3]. In this respect, it is of great importance to correctly identify the sub-types of WBCs. Mostly, detection and classification of WBCs is performed manually. However, this process is slow and quite dependent on the capabilities of the relevant operator. Thus, a computer-aided identification system is needed. The identification of WBCs is carried out in three stages using image processing and machine learning methods; these are segmentation, feature extraction and classification. In the segmentation process, the blood cells are dissociated from the background and other cell types. Next, a feature vector is generated using the properties of cells (color, shape, size etc.) [4–6]. Finally, feature vectors are used as input data, and classification is performed by machine learning methods [7–9]. When the literature was examined, it has been observed that various methods have been proposed in both the segmentation and classification stages of WBCs [10, 11]. However, while the accurate preparation of the blood samples is difficult enough anyway, trying to get correct segmentation of the resulting images makes the process further complicated. Moreover, since the errors made during the segmentation will affect classification success by negatively, the researchers have turned to deep learning methods. Owing to deep learning methods and the improved processing power of computers, more data can be analyzed more quickly. On the other hand, the most important advantage of deep learning compared to classical machine learning is that images can be used as input data directly [12]. Thus, a pixel-based classification can be implemented bypassing the segmentation and feature extraction stages (Figure 1). Due to the fact that each pixel carries a meaningful data on medical images, the data loss that caused by feature extraction can be prevented [13]. There are also some studies in the literature on the recognition and classification of WBCs by deep learning methods, which provide highly successful results in the analysis of medical images.

Figure 1: a) Classic machine learning vs b) deep learning.

Habibzadeh et al. [14], in their study, classified 140 segmented images using Support Vector Machines (SVM), Kernel-Principal Component Analysis (K-PCA) + SVM and Convolutional Neural Networks (CNN). Comparing the success rates obtained at the end of the study, the authors achieved a prediction accuracy of 66%, 74% and 85%, respectively. Choi et al. [15] increased 2,174 images to 48,000 samples by data augmentation methods. The authors, who also used some normalization techniques on the data, achieved 97% classification success. Li et al. [16] reduced the size of data by applying the Principal Component Analysis (PCA) method on two data sets, the first of which has 16,218 and the other has 8,004 images, and then performed classification with CNN. Comparing the results of 2

Jo

ur

na

lP

re

-p

ro of

their studies with those obtained from SVM, the authors achieved a 30% more success compared to classical machine learning method. Mundhra et al. [17] proposed a model for the segmentation and classification of WBCs in their studies. The authors, who performed segmentation using U-Net [18] architecture, performed classification on 17,340 images and reached success rates up to 99%. Shahin et al. [19] carried out an analysis of 2,551 images using the transfer learning method. The researchers, who applied a pre-processing procedure to their data and performed feature extraction with a pre-trained model, then had made the classification with the SVM. At the end of the study they achieved a success rate of 96%. Yu et al. [20] classified WBCs using CNN with 2,000 images. The authors who compared their studies with classical methods, obtained a success rate of 88.5%. Zhao et al. [21] proposed a hybrid model for detecting and classifying WBCs. In the model, first, granular structured WBC sub-types (eosinophil, basophil) were classified by SVM, then, made feature extraction using CNN, and finally, they classified other sub-types by Random Forest (RF). On the other hand, the authors, who passed their images from a pre-processing stage, gained the highest success rate for basophil (100%) and the lowest for eosinophil (70%). Habibzadeh et al. [22] achieved success rates of up to 100% in their studies by applying the fine-tuning method, that uses pre-trained models, on 11,200 samples. In another study, Jiang et al. [23] made a classification process with 102,000 images, which had been performed data augmentation and PCA, using total of 33-layer CNN model and obtained 83% success. Liang et al. [24] used a hybrid model that combining CNN and Recurrent Neural Networks (RNN), and classified 12,444 images. On the CNN side, transfer learning method was used, and on the RNN side, the authors performed training using the Long Short-Term Memory (LSTM) method. At the end of the study, a higher success rate (90.79%) was achieved compared to the CNN method. Qin et al. [25] attempted to classify WBCs using the Deep Residual Network (DRN) method. They increased the amount of data belonging to different classes in order to deal with “imbalanced class” problem by using data augmentation procedure, and achieved a success rate of 76.84%. Thanh et al. [26] raised the amount of data, which formed of 108 original images, up to 1,188 samples. At the end of the study, the classification success of CNN model was measured as 96.6%. Tiwari et al. [27], reported that they achieved higher prediction rates than the SVM method on 13,000 images, by performing a classification task according to the nuclei type and sub-types of WBCs. Vogado et al. [28] proposed a model to detect malignant WBCs for leukemia diagnosis. First, the researchers who made the feature extraction by utilizing the transfer learning method, reduced the size of the feature vectors they obtained with PCA, and at the end, classified them with SVM. They have achieved 100% success in their work. Although it is possible to observe high success rates with deep learning methods, these methods also have some handicaps. The biggest one of these is the high dependence on the amount of data [13]. It is possible to increase the data size to high counts by using data augmentation methods, but in this way, it is known that learning can be raised up to only a specific point and higher success rates can be achieved with more unique data [29]. Thus, in order to exceed this obstacle, while some authors used transfer learning or fine-tuning method [19, 20, 22, 28], some of them [19, 21, 24, 28], preferred hybrid models, in their studies. On the other hand, although deep learning methods can work with raw data by their nature, some authors have been observed to try to increase the success rate via preprocessing the data [14–17], [21–23, 25, 26]. Further, in the case of imbalanced data set, it was seen that some studies did not include a few classes in the classification operation or all classes was examined under two categories [24, 27]. So, there is a need for a model both does not need pre-processing procedure and can handle with “insufficient data problem” which is one of the problems encountered frequently in medical data sets. Capsule Networks emerges as a model that can overcome the above-mentioned problems. So, in this study, a model that both taking advantages of the deep learning and providing a robust classification in small data sets was proposed using Capsule Networks. The rest of the work continues as follows: In the second section, Capsule Networks and contribution in deep learning are explained, and then, in the third section, the data set and the proposed model were introduced. In the next section, the obtained results were discussed and compared with the other deep learning methods. Finally, at the last section of this study, the importance of the findings in terms of medical data analysis and machine learning was presented. 2. Deep Learning and Capsule Networks Machine learning can be defined as the transfer of some skills that people have to computers. In this way, machines too can learn like humans and interpret new data by using the information they have learned. For example, 3

lP

re

-p

ro of

artificial neural networks (ANN) were obtained by examining the relationships among neurons in the human brain and mimicking the biological nervous structure. So, the working principles of deep learning methods used in computer vision problems are designed as to mimic the processing of images in the visual cortex of the human and animal brain. In the study done by M. Riesenhuber and T. Poggio in 1999, HMAX model was used to be explained how the human brain analyzes images [30]. As seen in Figure 2(a), basic features are extracted from the image perceived by the receptive field in the first layers of the visual cortex. These properties are then transferred to the next layers and focused on more complex features. Similar properties between layers are summarized via maximum pooling-like operations and complexity is enhanced. Thus, the image is presented hierarchically in the brain and the class of the image is determined. However, this system, which works perfectly while the image is processed in brain, emerges some problems when it comes to modeling in computers.

Figure 2: a) Sketch of HMAX model, b) Hierarchical model of visual system.

Jo

ur

na

The maximum pooling used in the CNN ensures that the spatial information of the objects can be transferred relatively, but is insufficient to maintain the state of translational invariance. For example, if a window stands where a door should be and/or the door exists where a chimney should be, it would not make sense describing it as a house. But CNN will classify it as a house due to it only questions whether the property exists. It will not be interested with positions and hierarchical states of properties. On the other hand, since CNN stores the detected features as neuronal activities when trying to recognizing objects, different variations of the images should be included in the training set. Otherwise, it will be inadequate to correctly identify the images taken from different angles of the same object (“viewpoint invariance”). Furthermore, owing to the maximum pooling, halving the number of training parameters that have been trained will mean that some of the important features are lost. Because each pixel on medical images carries significant information, the effects of the loss will be much higher. In a study published in 1979, D. H. Hubel and T. N. Wiesel highlighted a different attribute of the brain; hyper columns. In the visual cortex of the brain, there are micro columns, each containing approximately 80–120 neurons. These columns use the same receptive field and provide various inferences from the image that sensed by the brain. The micro-columns perform the same type of extractions, whereas the hyper columns Figure 2(b), which is formed by the combination of many micro columns, contain all the information acquired from the image [31, 32]. Capsule Networks try to represent the properties of the image in a hierarchical order by mimicking the visual cortex’s this ability in order to overcome the problems of CNN has. To this end, it stores spatial relations of objects and objects’ parts in pose matrices (translation and transformation matrices). It then tries to recreate the object by using this information. Thus, even if the object’s properties such as its angle and position alter, the object can be correctly defined (inverse graphic) [33]. 4

Although G. Hinton had ideas about creating such a model since 1981 [34], he was able to test these ideas in real world problems in 2011 [35]. The reason of that is the Capsule Networks have high number of training parameter, which requires machines with high computation ability. 2.1. Structure of Capsule Networks Capsule networks basically consist of two sections: Encoder and Decoder. The decoder section is used to reconstruct the image while the encoder section is responsible for extracting features and classifying the image.

-p

ro of

2.1.1. Encoder In this section, the properties of the images are subtracted and converted into vectors containing “instantiation parameters”. Then training is carried out using these vectors and the class of the test image is tried to be estimated. In 2017, the decoding architecture used in the study done by Sabour et al. [36] while they were analyzing MNIST [37] data set is shown in Figure 3.

re

Figure 3: Encoder section of CapsNet (Source: [36]).

Jo

ur

na

lP

The encoder section of the Capsule Networks consists of three main parts; Convolution layer, PrimaryCaps and DigitCaps. In the convolution layer, feature extraction is performed from the image through convolution filters. Image with 28x28 resolution is scanned by 9x9 sized filters (padding = valid, strides = 1), and three-dimensional feature maps 20x20x256 are obtained which will have as much depth as the filter (256) used. Then, non-linearity is provided through the ReLU function, and the values on the feature map were activated. In the next step, the activated feature map is again scanned by 9x9 sized of 256 filters (padding = valid, strides = 2). In this section, it was aimed to provide more inferences from the feature map without increasing the training parameter so much by applying 1x1 convolution method [38]. At the end of the convolution process, a feature map with 6x6x256 dimension is obtained. The PrimaryCaps section is the part where feature maps, which are obtained after convolution processes, are reshaped and converted into vectors. In this section, the feature map with 256 layers is divided to have a depth of 8, and transformed into 32 primary capsules with 6x6x8 sizes of each (hyper columns). Each primary capsule consists of 6x6 = 36 low-level capsules (micro columns). As a result; 32 primary capsules, each one is formed of 8-dimensional 36x32 = 1152 low-level capsules are obtained. As known, after the extraction process with convolution layers, an activation function is applied to give the output values non-linearity. For normalization of capsules, which has not a neuron structure like ANN and CNN, G. Hinton proposed a new activation function. In this section of the model, a function called “squash” is used so that 8D vectors (or capsules) containing data from different layers can express a value in scale of [0, 1] (Equation 1). squash (u) =

kuk2 u 1 + kuk2 kuk

(1)

In the right part of this function, the unit vector (length = 1) value of the capsule is calculated, while on the left side, the extra squashing is applied. If the length of the vector is large, then the calculated value will approach to ‘1’, or will approach to ‘0’ if it is small. This value is also used to indicate the possibility of detecting property by the capsule. So, even if the direction of the vector changes, the probability value remains the same. Thus, a definition is provided which is not affected by neuronal and spatial changes. 5

On the other hand, the PrimaryCaps layer is the section where the capsules with 8 scalar quantities are converted into vectors that have directional information. For this purpose, weight matrices are used which enable the object to be expressed in the coordinate plane and provide affine transformation (Equation 2). Thus, the spatial relationships between the low- and high-level capsules are extracted. Since the values of this matrix updated with backpropagation, at the end of the training, by using the low-level capsules that keep the information between the object’s parts to be defined, it would be possible to describe correctly position, direction, etc. of the whole object that taking place in high-level capsule. While the number of rows of the matrix used in this process refer size of the low-level capsule, the number of columns refer to the size of the vector generated as output. The depth of the architecture and the size of the output vectors will also increase in proportion to the size of the matrix used. This means that more features can be interpreted. Sabour et al. defined the weight matrix as 8x16 in order to create 16D vectors. uˆ j|i = Wi, j ui

(2)

ro of

The 11, 520 predictions produced by 1152 capsules for 10 classes will not all be meaningful for each class. Therefore, a procedure called “dynamic routing algorithm” is used to determine which low-level capsule to be connected to which high-level capsule in the feed-forward process of Capsule Networks (Algorithm 1) [36]. Algorithm 1 Dynamic routing algorithm

re

3: 4: 5: 6: 7:

procedure ROUTING(ˆu j|i , r, l) for all capsule i in layer l and capsule j in layer (l + 1) : bi j ← 0. for r iterations do for all capsule i in layer l : ci ← so f tmax(bi ) P for all capsule j in layer (l + 1) : s j ← i ci j uˆ j|i for all capsule j in layer (l + 1) : v j ← squash(s j ) for all capsule i in layer l and capsule j in layer (l + 1) : bi j ← bi j + uˆ j|i · v j return v j

-p

1: 2:

ur

na

lP

This procedure works for all capsules and outputs (ˆu) located in all low-level layers (l) until reaching to iteration count r, and results the output value of the high-level capsule. The initial value of bi j , which is a temporary variable, is ‘0’ and updated in each iteration. When the procedure is complete, the bi j values are stored in the C matrix. The number of iterations defined in line 3 is a value that specifies how many times the dynamic routing algorithm will run. Sabour et al., in their study, identified this number as 3 and stated that over-increasing of this value can lead to over-fitting problems [36]. In line 4, the ci j values indicating both the similarity of each low-level capsule to high-level capsules and the connection probability are calculated. The “softmax” function is used to calculate the ci j values so that it was provided the summation of 10 estimations performed by each low-level capsule is ‘1’ and indicate probability (Equation 3).   exp bi j ci j = P (3) k exp (bik )

Jo

In the next two steps, alike the calculation of output values of the classical neurons, the output values of the lowlevel capsules that connected to each high-level capsule are multiplied with weights ci j and summed. Subsequently, the resulting vector is normalized with the squash activation function (Equation 4).



2

s j sj (4) vj =



2



1 + s j s j The last step is the section where low-level capsules’ outputs and the high-level capsules’ outputs are evaluated and the corresponding updates are performed. According to the result produced by the dot product, if there is a positive convergence between the two vectors, the bi j value of the related low-level capsule is increased, yet if there is a negative convergence (contrast), it is reduced (coupling effect). Thus, by the updates to be made after r iteration, the constants that ideally define the similarity and matching between the low- and high-level capsules will be obtained 6

(agreement). On the other hand, through the dynamic routing algorithm, high-level capsules will have cleaner input signals, as the outputs of the low-level capsules will only be sent to the high-level capsule with similar characteristics (those not having similar characteristics will approach to ‘0’). Thus, the hierarchical representations of the parts that make up the objects can be examined more clearly and more accurate predictions about the objects can be carried out. Besides, the DigitCaps layer is the section that contains the 10 (number of classes) high-level capsules (vectors) that each of 16D. Using the predictions produced by these vectors and through loss function, total error generated by the network for all outputs are calculated. Owing to this error value, it is ensured that the weight values (transformation matrix) are updated according to the optimization function used during the backpropagation stage of the network. At the end of the encoder section of the network, for the calculation of loss value on each output Equation 5 is used. 2 2 Lc = T c max 0, m+ − kvc k + λ (1 − T c ) max 0, kvc k − m− +

(5)

where; T c = 1 for correct classes, T c = 0 for the others, m = 0.9, m = 0.1 and λ = 0.5. The target vector that is used for each instance during training and that specifies the correct class is a 10D “one-hot encoding” vector (1 for the correct class and 0 for the other classes). When determining the loss value for each image, the values obtained from the left side of the function for the samples with the correct label will be included in the calculation (the values obtained from the right side for the other classes). If the sample is classified with a success of 90% and above, the loss value will be zero. On the other hand, if the image is estimated to be in wrong class with a ratio of 10% and less, the loss value again will be zero. While the λ in the formula is used to provide numerical stability, the L2 normalization is achieved by squares. Summarizing, at the end of the training process, it is attempted to obtain a model with at least 90% classification ability on the training data [36].

-p

ro of



Jo

ur

na

lP

re

2.1.2. Decoder The decoder section of the network does not directly contribute to the classification. The authors aimed to provide a regularization effect on the classification by using this section, thus creating a more stable model. In this section, which operates like variational auto-encoders, it has been tried to reconstruct the MNIST input with 28x28 size, using the only correctly classified DigitCap vector (masking the others) that obtained after the training. The major contribution of this section to the classification process is that since calculated error values are used in the total loss value, each high-level capsule will be forced to get the maximum inferences from the input data. In this way, the variables providing distinguishing of classes will be able to be learned.

Figure 4: Decoder section of CapsNet (Source: [36]).

In the decoder section used in the original article (Figure 4), the 16D vector output is weighted and fully-connected to 512 neurons, then to 1024 neurons. ReLU was chosen as the activation function in these two layers. In the final layer, the 1024 output obtained from the previous layer fully-connected to the number of neurons that equal to the pixel count of desired image (28x28 = 784) to be constructed. However, unlike the other layers, the sigmoid activation function was used in order to ensure that output will express a pixel value in the scale of [0, 1]. Once the resulting output values have been reshaped, a gray-scale image at 28x28 resolution is obtained. The mean squared error calculated between this image and the input image is also expressed as the reconstruction loss 7

value. Finally, the total loss generated by whole network for an image is calculated by summation of the loss obtained from the encoder and decoder sections (Equation 6). Loss = encoderloss + α decoderloss

(6)

Chosing the value α = 0.0005 in the formula, it is intended avoiding of the reconstruction loss value dominate the encoder loss value. Thus, regularization effect is kept [36]. 3. Data set and Proposed Architecture

re

-p

ro of

There are 4,000-11,000 WBCs in 1 microliter of blood. Approximately 40% - 70% of them are neutrophils, 20% 45% lymphocytes, 2% - 10% monocytes, 1% - 6% eosinophils and <1% basophils [2]. The high variability of the subtypes of WBCs in the blood sample is one of the factors that make it difficult to be performed a robust analysis with machine learning. Because, the number of samples belonging to each class in the data set to be classified should be close to each other. Otherwise, “imbalanced data set problem” occurs and an accurate classification model covering all classes cannot be obtained [39]. In addition, it should be noted that in order to prevent over-fitting, the data used in training process should be unique that would facilitate the distinction between classes and reflect the specific characteristics of the classes. When specifying the data set to be analyzed, considering the details mentioned above, LISC [4] data set was chosen. In the data set, the images with 720x576 sizes belonging to different classes were saved as taking slices with 128x128 sizes where only one WBC would be in each sample. Some instances and characteristics of the data set are given in Table 1. The resulting 263 samples were separated as 85% of training (223 samples) and 15% of test (40 samples). Before starting the training, they were normalized by dividing 255 of pixel values in each channel of the images. Thus, the network is intended to converge faster [40]. On the other hand, target vectors that indicating correct classes of images were created using the “one-hot encoding” method. Then, the model was built and passed to training stage. Table 1: Properties of data set.

Sample Image

na

Resolution # of Channels # of Samples Training Set Test Set

Basophil

58 49 9

42 36 6

ur

3.1. WBCaps

Eosinophil

Lymphocyte

lP

WBC Sub-type

128x128 3 59 50 9

Monocyte

Neutrophil

48 43 5

56 45 11

Jo

While constructing the model, we optimized the hyper-parameters using the method called ”babysitting” [41, 42]. In this method, we tested the model for 10 epochs while employing different combinations of hyper-parameters. In this way, we had a chance to examine how the model evolved over time and also to observe the negative and positive effects of different hyper-parameters on the model. On the other hand, while deciding on some hyper parameters, we investigated many papers including original study and developed our model with some commonly used and accepted techniques in the deep learning community. All chosen hyper-parameters with the reasons were briefly mentioned in the rest of the section. Although Capsule Networks have many advantages over CNN, the most important disadvantage is that the number of parameters to be trained is quite high. Even, the size of the images used in the original article is quite small and the architecture is shallow, there are ∼ 8.2M training parameters. Approximately ∼ 5M of these parameters come from the PrimaryCaps layer where reshaping and dynamic routing operations are executed. Therefore, the larger-sized images to be processed in this layer means the greater number of parameters to be trained. Both Sabour et al. [36] that 8

-p

ro of

examining abilities of Capsule Networks on smallNORB [43] data set in the second part of their work, and Afshar et al. [44] that classifying brain tumor types, reduced the size of the images in the data sets to 32x32 in order to overcome this issue. However, as mentioned before, each pixel of medical images can carry meaningful data. That’s why, reducing the size of the images is likely to cause the loss of important pixels. Thus, to reduce the size of the input with 128x128 size, one more convolution layer was added in front of the model we built and stride = 2 was chosen for the filters located in first two layers. In this way, we were able to reduce the number of training parameters by avoiding data loss. Besides, due to us having a small data set, it was aimed that the model works with “finger prints” of the images (metaphorically), which characterize each sample as qualified and will facilitate the analysis by making the distinction of images easier (Figure 5). In the end, the size of the data before the PrimaryCaps layer was able to be depressed to 32x32 level. In addition, by making padding = same for all convolution filters, the receptive field had been increased and aimed to obtain maximum extraction from pixels that existing on borders of samples.

re

Figure 5: Proposed architecture. The biggest change made according to the original model is to be placed another convolution layer to the front of the model (stride = 2) to allow more extractions from the images without further increasing the training parameter. Thus, we avoided data loss caused by max-pooling or down-scaling the samples.

Jo

ur

na

lP

On the other hand, due to the low number of our training data, the number of convolution filters used in our model are kept low in order to get more significant features and to prevent a non-activated convolution layer occurs. In addition, in order to prevent the occurrence of the “dying neuron” phenomena, unlike the original article, PReLU [45] activation function was used in the both convolution and fully-connected layers. Since the depth of the last convolution layer was 128, the number of PrimaryCaps layers that pass into the dynamic routing procedure was 128/8 = 16 when the 8D vectors were formed. The number of low-level capsules existed in PrimaryCaps layer was 32x32x16 = 16, 384. Because the classification of WBCs is a more complicated task than character recognition, the number of low-level capsules has been kept so high in order to identify more specific features. The high-level capsule layer (WBCaps), in which the classification is performed, consists of 16D capsules as much as number of classes (5). The decoder section of the network is formed of fully-connected layers with 64 and 256 neurons, respectively. Finally, the last layer, in which will provide the reconstruction of the image given as the input data, was composed of 128x128x3 = 49, 152 neurons. One of the important hyper-parameters used in the training process is the optimization algorithm that computes updates to be performed during backpropagation. Here, ADAM (Adaptive Momentum Estimation) [46] function was preferred which trying to prevent “local minimum problem” occurs by providing an “adaptive” update according to the parameters of the network. One of the most important details of our model is that having been used “k-fold cross validation”. By selecting k = 10, the model was trained with 200 of 223 images and remaining 23 of them used for validation. Then, the training and validation sets were shuffled in each epoch, and in this way, it was aimed to avoid of the training data from being over-fitted and to produce a stable result [47]. The architecture, which has a total of ∼ 23.5M training parameters, was trained for 50 epochs. 4. Results and Discussion 4.1. WBCaps Results The graph showing the calculated total loss, encoder loss, decoder loss, training accuracy and validation accuracy for each epoch of the network can be seen in Figure 6. As mentioned earlier, the section of the network responsible for 9

.

ro of

Figure 6: Training and validation curves

na

lP

re

-p

classification has the highest role in the total success of the model and the loss value calculated in the decoder section has a regularization effect, as can be seen clearly. On the other hand, it was observed that the network converged in the 27th epoch and achieved the highest validation success of (22/23 = 0.9565). After the training completed, the classification performance on the test data that the model had never seen before was measured as (37/40 = 0.9250).

ur

Figure 7: a) Reconstructed test data. The yellow-green appearance in the images is due to the variable type transformations in the normalization process. But, since the images are exposed to these transformations at the same rate, it cannot be said there is a data loss in the color channels or the inferences of the different cells’ responses to the staining process used during the peripheral smear cannot be made, b) class map. Three instances misclassified.

Jo

Another advantage provided by the Capsule Networks is that the reconstructed images can be visually inspected, thus allowing them to analyze how close they are to the original images and how successful the training is. When the images were examined, it was seen that the reconstructed images did not get too close to the original test data because of the insufficient number of training data set (Figure 7). On the other hand, the fact that this task is more complex than the character recognition performed in the original work led to the appearance of visually deficient images. However, the most important detail that should be considered here is that the samples belong to same classes of the 40 tested images were similarly formed, as seen in Figure 7(a). When the class map shared in the figure, Figure 7(b) was investigated, it was seen that the reconstructed versions of the original images that belonged to same classes were identical except three samples. Moreover, reconstructed images provided us a opportunity to observe that capsules focused on what properties of WBCs. For example, the basophil, which visually has no cytoplasm, was depicted as a large and dark filled circle, whereas the lymphocyte, which is smaller in size than the basophil and has a small amount of cytoplasm, was demonstrated as a small filled circle having a clear orbit around it. While the model described the neutrophil and monocyte with a clear background around them, it depicted the neutrophil as circular, 10

but monocyte as irregular. Finally, the eosinophil, whose cytoplasm is less than monocyte and neutrophil, but greater than basophil and lymphocyte, was formed as a slightly light circle in a dark background. In light of this information, it can be concluded that the capsules focus on the amount of cytoplasm and the shape characteristics of WBCs. In brief, although the reconstruction process fails in a visual sense, it can be said that during the training, the encoder section was able to learn the distinctive features of different classes correctly. Table 2: Confusion matrix and misclassified WBC samples.

Predicted Label WBC Sub-type

Basophil

Eosinophil

Lymphocyte

Basophil

8

0

0

5

0

0

Neutrophil 0

0

0

9

0

Monocyte

0

0

0

4

Neutrophil

0

0

0

0

ro of

Lymphocyte

0

0

11

-p

True Label

Eosinophil

Monocyte

lP

re

In the next stage of our study, a confusion matrix was created to examine the effect of correctly and incorrectly classified test data on the overall success of the model (Table 2). At the same time, misclassified images were investigated in more detail and factors that negatively affected the success of the model were discussed. When the misclassified basophil and eosinophil images are examined, it can be said that they are different and deformed cells compared to the other samples in the data set. This state can be interpreted as the main reason why the capsules failed in these samples. However, there was observed a complete failure of the model for the monocyte cell. Another important factor that leads to the misclassification of eosinophil and monocyte samples may be that the training data of the two related classes are relatively less than the other classes. Then, statistical measures such as sensitivity, specificity, precision and accuracy were calculated by means of this matrix, which we can evaluate the success of the model in terms of each class and overall success (Table 3).

na

Table 3: Classification report.

Sensitivity (%) 88.89 83.33 100.00 80.00 100.00 92.50

Specificity (%) 96.67 100.00 100.00 97.06 96.30 98.01

Precision (%) 88.89 100.00 100.00 80.00 91.67 92.50

Accuracy (%) 94.87 97.37 100.00 94.87 97.37 96.86

Jo

ur

WBC Sub-type Basophil Eosinophil Lymphocyte Monocyte Neutrophil Avg./Total

It would not be wrong to say that the model we have created has generally achieved a high classification accuracy (96.86%). The number of samples for related classes in the test data being low may be shown as the main reason for the low sensitivities of Eosinophil and Monocyte and low precision of the Monocyte. Because of that, misclassification of even one sample resulted in a negative effect of up to 20% on the results. 4.2. WBCaps vs Others In this part of the study, success rates obtained by using Capsule Networks were compared with the results obtained from CNN and Transfer Learning method in order to confirm this hypothesis: “Capsule networks are more successful in classifying small data sets than other deep learning methods.” 11

Figure 8: Choosing of deep learning method according to similarity and size of data set.

Jo

ur

na

lP

re

-p

ro of

As mentioned before, the success of deep learning methods increases in proportion to the size of the data set. However, in the deep learning literature, there are some methods to ensure a successful classification when the data sets are small. One of them is transfer learning method. In this method, the classification of the new data set is provided by using a model that has been previously used in a different classification task. For this purpose, while preserving weight values found in feature extractor layers of model, the feed-forward process is carried out with the new data set and feature vectors are obtained. In the final stage, the resulting feature vectors are linked to the softmax or SVM layer that has output neurons up to number of classes. Thus, it becomes possible to transfer previously learned properties and classification capabilities obtained from a different data set to new classification task. Another method is the fine-tuning. In this method, while freezing the parameters of some of the layers in the model, the weights of the others (or whole network) are re-trained and updated in each epoch, and the feature extraction of the images is performed. In this way, while the basic features are detected in the first parts of the model, it is aimed to create a model that can focus on the more complex parts of the feed-forwarding new images in the next layers. Thus, a system that able to be adapted to the new data set can be obtained in a faster way [48]. In addition to these methods, there is a one more method called “training from scratch” which means building a new CNN model from zero, and performing training-testing operations with the data set. It can be evaluated according to two criteria which one of transfer learning, fine-tuning and training from scratch methods will be used on the data set that will be classified (Figure 8). If we have a large data set and the model to be used has been trained previously with a similar data set, using fine-tuning method will not result over-fitting of the training data and a successful classification can be realized. If the data set is large but there is no model previously trained with a similar data set, then it may be advisable to perform training by constructing a CNN model from scratch rather than transfer learning and fine-tuning. On the other hand, in cases where the data set is small and similar to data set used in pre-trained model, it will be reasonable to use capabilities of the model where complex feature extractions are made. So, extracting features of new data set through transfer learning method and training only the classifier part that exists at the end of the model can be suggested. However, if the data set is small but does not have the same characteristics as the pre-trained model’s data set, then to achieve success, it will be making sense to train last few layers and classifier part, while holding first layers’ parameters of the model. Otherwise, a fine-tuning process with the small data set will result over-fitting of the training data [49]. Since the data set we used for the classification of WBCs was small, the transfer learning method was chosen in the tests to be performed. On the other hand, since there is no similarity between the data set will be used and the data set used in pre-trained models, not only the classifier part of the models, but also the parameters in the last two layers were included to training process. At the end of the model, a softmax layer with five outputs was added. In this way, the architectures that focusing properties like edge, shape, etc. in the first layers, but focusing on the more complex features of WBCs in the following layers are obtained. In addition to transfer learning, the classification of the data set was realized by creating a model alike the architecture of LeNET [37]. Thus, a bench-marking environment has been created for the success rates that CNN and Capsule Networks will achieve in a training from scratch. ImageNET [50] data set, which is capable of interpreting the other images well too that not present in the data set due to containing images from 1000 different categories, was preferred in the initial weighting of the models to be used. Besides, it was assumed that maximum inferences can be made in our data set, choosing the models that showed 12

ro of

-p

Figure 9: Training and validation curves of CNN and transfer learning methods.

lP

re

high performance on this data set. Eventually, the models used for testing were determined as Inception-ResNETv2 [51], Inceptionv3 [52], ResNET50 [53] and VGG19 [54]. Another important reason for the preference of these models is that they are using “network-in-network” [38] and Deep Residual Networks (DRN) methods in addition to CNN. Thus, the success rates of the other learning methods on the data set were analyzed, also. All models were trained for 50 epochs and k-fold cross validation method was used. 4.3. Discussion

Jo

ur

na

As can be seen in Figure 9, when training from scratch with CNN was carried out, the network tended to overfitting training data starting from the 20th epoch as expected due to lack of training data. On the other hand, since the data were shuffled in each epoch, it was observed that the obtained validation rates were quite low and the CNN model reduced the classification success to almost “head-or-tails odds” (69.57%). The success rate of the model on test samples measured as 62.50%. Furthermore, in the tests conducted on pre-trained models, even though a classification process was executed including the last two layers before the classifier in the training, high variance of the data could not be prevented. Nevertheless, unlike CNN, it can be said that the models learned complex features of WBCs and made successful estimations on the training data set. The highest validation rates seen for the four models were; Inception-ResNETv2: 95.65%, Inceptionv3: 91.30%, ResNET50: 91.30% and VGG19: 95.65%. Unfortunately, when executed the classification on the test data, the predictions were remained behind the validation percentages. The measured precision values were; Inception-ResNETv2: 82.50%, Inceptionv3: 80.00%, ResNET50: 80.00% and VGG19: 77.50%, respectively. Although transfer learning can achieve success over many classification problems, it has been observed that since the training processes of these models are performed with macroscopic world images, success rates on microscopic data is relatively low. In addition, due to the fact that these architectures are very deep, over-fitting of the training data could not be prevented, even if the training and validation data continuously were shuffled through cross-validation and the final layers of the model were included in the training. Since the images used for validation are part of the training data in fact, it would be wrong to consider the high validation rates as the results of a stable model. Moreover, one of cons of the transfer learning method is that these models are fixed architectures. For this reason, the data to be used as input should be the same size as the images used in the previously trained data set. As the down-scaling 13

of medical images causes the loss of valuable pixels, it is likely that the up-scaling will also undermine the correct classification by adding noises to the image. Another reason negatively affecting the smooth functioning of transfer learning on medical images is that these images are not always color images, such as; MR images, ultrasound etc. In this regard, models that are trained on 3-channel images will not be able to be used directly on black-white images. This means pre-processing, hence the time loss and the corruption of the originality of the data. The greatest innovations that Capsule Networks bring to deep learning are that it can reach high success with limited data, it is not adversely affected by the pose and positional changes, and it is able to analyze the parts of the objects and hierarchies effectively through capsule activations. These features makes Capsule Networks advantageous over other deep learning methods. Unfortunately, since the routing process requires a strong calculation unit and is very slow, it is difficult to measure how successful the method is in large data sets.

ro of

4.4. Further Research Although, the success of the proposed architecture in the LISC data set is a good start, it needs to be questioned how the model will behave in the other WBC data sets. Since the proposed architecture was modified to meet the requirements of the LISC data set, a Capsule Network architecture that functions robustly in larger data sets can be proposed. Besides, the abilities of Capsule Networks in domain adaptation and transfer learning is also worth to be examining. 5. Conclusion

Jo

ur

na

lP

re

-p

The fact that medical images are in complex structures makes the analysis to be performed with these images one of the most difficult image processing tasks. Moreover, since medical images are not of a certain standard (different number of color channels, resolutions, focus ratios etc.), various methods have been proposed from past to the present by researchers who want to perform segmentation and/or classification. If these methods require using classical machine learning methods with feature vectors to be obtained after analysis, the images must be subjected to a long pre-process stage. On the other hand, the success of the modeling made with the obtained meaningful information is directly proportional with the success of pre-processing stage. Therefore, researchers have been put forth studies to find a faster, more successful and more global method, day by day. Deep learning has been one of the most popular methods of machine learning in the past few years. One of the most important reasons for this popularity is that through this method, an end-to-end training can be carried out by preventing the negative effects caused by the pre-processing. However, as in every method, there are some obstacles in this method. One of them is that its dependency of data size. Although it is possible to achieve maximum gains from different variations of existing data by applying data augmentation techniques, if the number of samples is low, the over-fitting of the training data cannot be prevented and the successes observed in the training process will not be obtained on the test data. Therefore, there is a need of a method that allows to successful modeling by providing the maximum inferences over a small data set and the necessary flexibility over the hyper-parameters throughout the entire training process. Capsule networks can be used as a method to solve such problems that other deep learning methods have. Thanks to the structure of this model, which mimics human brain that processing and representing the images in a hierarchical order, it enables an accurate analysis to be performed that is not adversely affected by the changes in exposure and translation. It also has the ability to make robust modeling even in small data counts. If we discuss clinical impacts of this study, firstly, we can say that Capsule Networks handled well 3-channel medical data. In addition, the fact that Capsule Networks had high success rates even in small amounts of data will reduce the need for an expert as much as possible. On the other hand, by adapting the model to the new data within the scope of transfer learning, it may even be possible increasing the model’s success to higher levels. Due to Capsule Networks have high prediction scores after an end-to-end training process, the data can be processed without any data increasing procedures (data augmentation, up-sampling, down-sampling, etc.). Thus, data and time loss caused by both cost and ethics can be prevented. In this way, it will be possible to employ an “object recognition and classification model” that reduces dependence on the labeling that provided by an expert, thus allowing the process to be much faster. Apart from WBCs, it may be possible to create “fully automated CAD” models using Capsule Networks and achieve high success rates for the other medical data types. 14

In this study, the analysis and classification of the WBC sub-types on the peripheral blood smear images, which provide important information about the human morphology, were performed with Capsule Networks. In the first phase of the study, WBCs were examined from scratch and through an end-to-end training process, with the network developed with some modifications according to the original model. Although the study was performed with a small data set, high accuracy rates were achieved (96.86%). In the second phase of the study, the capabilities of Capsule Networks were compared with other deep learning methods. Here, the advantages of Capsule Networks over other methods were discussed and the success rates obtained on the same data set were evaluated. Thus, it has been shown that Capsule Networks can achieve higher success on small data sets than other models without using any data augmentation technique and pre-processing. If the results of this study are evaluated from a philosophical point of view, the fact that the machine, which does not have any previous knowledge about the medical images will to be analyzed, reaching the levels of almost a hematologist on a limited number of images is very important in terms of showing the point where artificial intelligence comes today.

ro of

References

Jo

ur

na

lP

re

-p

[1] B. Alberts, A. D. Johnson, J. Lewis, D. Morgan, M. Raff, K. Roberts, P. Walter, Molecular Biology of the Cell, 6th Edition, W. W. Norton & Company, New York, NY, US, 2014 (Nov. 2014). [2] A. Adewoyin, B. Nwogoh, Peripheral blood film - A review, Annals of Ibadan Postgraduate Medicine 12 (2) (2014) 71–79 (Dec. 2014). [3] V. Kumar, A. K. Abbas, J. C. Aster, Robbins Basic Pathology, 10th Edition, Saunders, Philadelphia, PA, 2012 (Jun. 2012). [4] S. H. Rezatofighi, H. Soltanian-Zadeh, Automatic recognition of five types of white blood cells in peripheral blood, Computerized Medical Imaging and Graphics 35 (4) (2011) 333–343 (Jun. 2011). [5] M. Habibzadeh, A. Krzyz, Comparative study of shape, intensity and texture features and support vector machine for white blood cell classification, Journal of Theoretical and Applied Computer Science 7 (1) (2013) 20–35 (2013). [6] L. B. Dorini, R. Minetto, N. J. Leite, Semi-automatic white blood cell segmentation based on multiscale analysis, IEEE Journal of Biomedical and Health Informatics 17 (1) (2013) 250–256 (Jan. 2013). [7] J. Rawat, A. Singh, H. S. Bhadauria, J. Virmani, Computer aided diagnostic system for detection of leukemia using microscopic images, Procedia Computer Science 70 (2015) (2015) 748–756 (Jan. 2015). [8] H. Ramoser, V. Laurain, H. Bischof, R. Ecker, Leukocyte segmentation and classification in blood-smear images, in: 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference, Shanghai, China, 2005, pp. 3371–3374 (Jan. 2005). [9] N. Theera-Umpon, White blood cell segmentation and classification in microscopic bone marrow images, in: L. Wang, Y. Jin (Eds.), Second International Conference Fuzzy Systems and Knowledge Discovery, Lecture Notes in Computer Science, Springer Berlin Heidelberg, Changsha, China, 2005, pp. 787–796 (2005). [10] S. N. D. Pergad, S. T. Hamde, Review on feature extraction and classification of WBC in bone marrow for disease diagnosis, IJIREEICE 4 (1) (2016) 68–73 (Jan. 2016). [11] J. Rawat, H. S. Bhadauria, A. Singh, J. Virmani, Review of leukocyte classification techniques for microscopic blood images, in: 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 2015, pp. 1948–1954 (Mar. 2015). [12] K. Suzuki, Overview of deep learning in medical imaging, Radiological Physics and Technology 10 (3) (2017) 257–273 (Sep. 2017). [13] D. Shen, G. Wu, H.-I. Suk, Deep learning in medical image analysis, Annual Review of Biomedical Engineering 19 (1) (2017) 221–248 (Jun. 2017). [14] M. Habibzadeh, A. Krzy˙zak, T. Fevens, White blood cell differential counts using convolutional neural networks for low resolution images, in: 12th International Conference, ICAISC 2013, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, Zakopane, Poland, 2013, pp. 263–274 (Jun. 2013). [15] J. W. Choi, Y. Ku, B. W. Yoo, J.-A. Kim, D. S. Lee, Y. J. Chai, H.-J. Kong, H. C. Kim, White blood cell differential count of maturation stages in bone marrow smear using dual-stage convolutional neural networks, PLoS ONE 12 (12) (2017) e0189259 (Dec. 2017). [16] X. Li, W. Li, X. Xu, W. Hu, Cell classification using convolutional neural networks in medical hyperspectral imagery, in: 2nd International Conference on Image, Vision and Computing (ICIVC), IEEE, Chengdu, China, 2017, pp. 501–504 (Jun. 2017). [17] D. Mundhra, B. Cheluvaraju, J. Rampure, T. Rai Dastidar, Analyzing microscopic images of peripheral blood smear using deep learning, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Vol. 10553 of Lecture Notes in Computer Science, Springer International Publishing, Qu´ebec City, QC, Canada, 2017, pp. 178–185 (2017). [18] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Lecture Notes in Computer Science, Springer International Publishing, Munich, Germany, 2015, pp. 234–241 (2015). [19] A. I. Shahin, Y. Guo, K. M. Amin, A. A. Sharawi, White blood cells identification system based on convolutional deep neural learning networks, Computer Methods and Programs in Biomedicine (Nov. 2017). [20] W. Yu, J. Chang, C. Yang, L. Zhang, H. Shen, Y. Xia, J. Sha, Automatic classification of leukocytes using deep neural network, in: IEEE 12th International Conference on ASIC (ASICON), IEEE, Guiyang, China, 2017, pp. 1041–1044 (Oct. 2017). [21] J. Zhao, M. Zhang, Z. Zhou, J. Chu, F. Cao, Automatic detection and classification of leukocytes using convolutional neural networks, Medical & Biological Engineering & Computing 55 (8) (2017) 1287–1301 (Aug. 2017). [22] M. Habibzadeh, M. Jannesari, Z. Rezaei, H. Baharvand, M. Totonchi, Automatic white blood cell classification using pre-trained deep learning models: ResNet and Inception, in: Tenth International Conference on Machine Vision (ICMV 2017), Vol. 10696, International Society for Optics and Photonics, Vienna, Austria, 2018, p. 1069612 (Apr. 2018).

15

Jo

ur

na

lP

re

-p

ro of

[23] M. Jiang, L. Cheng, F. Qin, L. Du, M. Zhang, White blood cells classification with deep convolutional neural networks, International Journal of Pattern Recognition and Artificial Intelligence 32 (9) (2018) 1857006 (Sep. 2018). [24] G. Liang, H. Hong, W. Xie, L. Zheng, Combining convolutional neural network with recursive neural network for blood cell image classification, IEEE 6 (2018) 36188–36197 (2018). [25] F. Qin, N. Gao, Y. Peng, Z. Wu, S. Shen, A. Grudtsin, Fine-grained leukocyte classification with deep residual learning for microscopic images, Computer Methods and Programs in Biomedicine 162 (2018) (2018) 243–252 (Aug. 2018). [26] T. T. P. Thanh, C. Vununu, S. Atoev, S.-H. Lee, K.-R. Kwon, Leukemia blood cell image classification using convolutional neural network, International Journal of Computer Theory and Engineering 10 (2) (2018) 54–58 (2018). [27] P. Tiwari, J. Qian, Q. Li, B. Wang, D. Gupta, A. Khanna, J. J. P. C. Rodrigues, V. H. C. de Albuquerque, Detection of subtype blood cells using deep learning, Cognitive Systems Research 52 (2018) 1036–1044 (Dec. 2018). [28] L. H. S. Vogado, R. M. S. Veras, F. H. D. Araujo, R. R. V. Silva, K. R. T. Aires, Leukemia diagnosis in blood slides using transfer learning in CNNs and SVM for classification, Engineering Applications of Artificial Intelligence 72 (2018) (2018) 415–422 (Jun. 2018). [29] L. Perez, J. Wang, The effectiveness of data augmentation in image classification using deep learning, arXiv:1712.04621 [cs] (Dec. 2017). [30] M. Riesenhuber, T. Poggio, Hierarchical models of object recognition in cortex, Nature Neuroscience 2 (11) (1999) 1019–1025 (Nov. 1999). [31] T. Serre, Hierarchical models of the visual system, in: Encyclopedia of Computational Neuroscience, Springer New York, New York, NY, US, 2013, pp. 1–12 (2013). [32] D. H. Hubel, T. N. Wiesel, Brain mechanisms of vision, Scientific American 241 (3) (1979) 150–162 (Sep. 1979). [33] G. Hinton, S. Sabour, N. Frosst, Matrix capsules with EM routing, in: 6th International Conference on Learning Representations, Vancouver, BC, Canada, 2018, p. 15 (2018). [34] G. E. Hinton, A parallel computation that assigns canonical object-based frames of reference, in: Seventh International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 1981 (1981). [35] G. E. Hinton, A. Krizhevsky, S. D. Wang, Transforming auto-encoders, in: 21st International Conference on Artificial Neural Networks, Vol. 6791 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, Espoo, Finland, 2011, pp. 44–51 (2011). [36] S. Sabour, N. Frosst, G. E. Hinton, Dynamic routing between capsules, arXiv:1710.09829 [cs] (Oct. 2017). [37] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324 (Nov. 1998). [38] M. Lin, Q. Chen, S. Yan, Network in network, arXiv:1312.4400 [cs] (Dec. 2013). [39] H. He, E. A. Garcia, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21 (9) (2009) 1263–1284 (Sep. 2009). [40] F. Chollet, Building Powerful Image Classification Models Using Very Little Data (2018). URL https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html [41] A. Karpathy, Neural Networks Part 3: Learning and Evaluation (2018). URL http://cs231n.github.io/neural-networks-3/ [42] Y. Bengio, Practical recommendations for gradient-based training of deep architectures, in: Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2012, pp. 437–478 (2012). [43] Y. LeCun, F. J. Huang, L. Bottou, Learning methods for generic object recognition with invariance to pose and lighting, in: 2004 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2004), Vol. 2, IEEE, Washington, DC, US, 2004, pp. II–104 Vol.2 (Jun. 2004). [44] P. Afshar, A. Mohammadi, K. N. Plataniotis, Brain tumor type classification via capsule networks, in: 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 2018, pp. 3129–3133 (Oct. 2018). [45] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in: 2015 IEEE International Conference on Computer Vision (ICCV), IEEE, Santiago, Chile, 2015, pp. 1026–1034 (Dec. 2015). [46] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: 3rd International Conference on Learning Representations, San Diego, US, 2014 (Dec. 2014). [47] P. Refaeilzadeh, L. Tang, H. Liu, Cross-validation, in: Encyclopedia of Database Systems, Springer, Boston, MA, US, 2009, pp. 532–538 (2009). [48] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep neural networks?, in: Advances in Neural Information Processing Systems 27, Montreal, QC, Canada, 2014 (2014). [49] A. Karpathy, Transfer Learning (2018). URL http://cs231n.github.io/transfer-learning/ [50] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems 26, Stateline, NV, US, 2013, pp. 1097–1105 (2013). [51] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: 32nd International Conference on Machine Learning, Vol. 37, Lille, France, 2015, pp. 448–456 (Feb. 2015). [52] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the Inception architecture for computer vision, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), IEEE, Las Vegas, NV, US, 2016, pp. 2818–2826 (Jun. 2016). [53] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), IEEE, Las Vegas, NV, US, 2016, pp. 770–778 (Jun. 2016). [54] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556 [cs] (Sep. 2014).

16