Visual features based automated identification of fish species using deep convolutional neural networks

Visual features based automated identification of fish species using deep convolutional neural networks

Computers and Electronics in Agriculture xxx (xxxx) xxxx Contents lists available at ScienceDirect Computers and Electronics in Agriculture journal ...

7MB Sizes 0 Downloads 64 Views

Computers and Electronics in Agriculture xxx (xxxx) xxxx

Contents lists available at ScienceDirect

Computers and Electronics in Agriculture journal homepage: www.elsevier.com/locate/compag

Visual features based automated identification of fish species using deep convolutional neural networks Hafiz Tayyab Raufa, M. Ikram Ullah Lalia, , Saliha Zahoora, Syed Zakir Hussain Shahb, Abd Ur Rehmana, Syed Ahmad Chan Bukharic ⁎

a

Department of Computer Science, University of Gujrat, Gujrat, Pakistan Department of Zoology, University of Gujrat, Gujrat, Pakistan c Division of Computer Science, Mathematics and Science, College of Professional Studies, St. John’s University, New York, United States b

ARTICLE INFO

ABSTRACT

Keywords: Fish species classification VGGNet Deeply supervised VGGNet

Morphological based fish species identification is an erroneous and time-consuming process. There are numerous fish species and due to their close resemblance with each other, it is difficult to classify them by external characters. Recently, computer vision and deep learning-based identification of different animal species is being widely used by the researchers. Convolutional Neural Network (CNN) is one of the most analytically powerful tools in deep learning architecture for the image classification based on visual features. This work aims to propose a deep learning framework based on the CNN method for fish species identification. The proposed CNN architecture contains 32 deep layers that are considerably deep to derive valuable and discriminating features from the image. The deep supervision is inflicted on the VGGNet architecture to increase the classification performance by instantly adding four convolutional layers to the training of each level in the network. To test the performance of proposed 32-Layer CNN architecture, we developed a dataset termed as Fish-Pak and is publicly available at Mendeley data (Fish-Pak: https://doi.org/10.17632/n3ydw29sbz.3#folder-6b024354-bae3-460aa758-352685ba0e38). Fish-Pak contains 915 images with six distinct classes; Ctenopharyngodon idella (Grass carp), Cyprinus carpio (Common carp), Cirrhinus mrigala (Mori), Labeo rohita (Rohu), Hypophthalmichthys molitrix (Silver carp), and Catla catla (Thala) and three different image views (head region, body shape, and scale). To ensure the superior performance of proposed CNN architecture, we have carried out the experimental comparison with other deep learning frameworks involving VGG-16 for transfer learning, one block VGG, two block VGG, three block VGG, LeNet-5, AlexNet, GoogleNet, and ResNet-50 on the Fish-Pak data set. Comprehensive empirical analyses reveal that the proposed method achieves state of the art performance and outperforms existing methods.

1. Introduction Fish have been a major source of healthy nutrients of human food throughout the year. Fishes are the most diverse group of vertebrates with more than 33,000 species (Oosting et al., 2019). These species largely differ in taste, flavour and nutritional value. Fish species are identified on the basis of their specimens. These specimens exist in terms of visual features (dos Santos and Gonçalves, 2019; Tseng and Kuo, 2019). These visual characteristics include their shape, texture, colour and head shape. Correct identification of different species helps in scientific fields including fish medicine, ecology, taxonomy and evolutionary studies (Salman et al., 2019). The unique features to

identify and differentiate between culturable fish species are on the fingertips of the fishermen. However, automated techniques for fish species recognition can improve he results and help in more detailed scientific studies of fish behaviour using fish images and videos under water (Piechaud et al., 2019). The automated fish recognition is mainly based on pattern matching, physical and statistical behaviour and features extraction. Fish recognition is also important for fish species counts, population assessment, fish counting, and study of fish association and to monitor the ecosystems (Ogunlana et al., 2015; Lu et al., 2019; Olsvik et al., 2019). The underwater fish species classification is performed for large datasets by applying machine learning Convolutional Neural Network and Deep learning and image processing

Corresponding author. E-mail addresses: [email protected] (H.T. Rauf), [email protected] (M.I.U. Lali), [email protected] (S. Zahoor), [email protected] (A.U. Rehman), [email protected] (S.A.C. Bukhari). ⁎

https://doi.org/10.1016/j.compag.2019.105075 Received 17 July 2019; Received in revised form 10 September 2019; Accepted 27 October 2019 0168-1699/ © 2019 Elsevier B.V. All rights reserved.

Please cite this article as: Hafiz Tayyab Rauf, et al., Computers and Electronics in Agriculture, https://doi.org/10.1016/j.compag.2019.105075

Head region

Fin rays

Mouthis terminal to sub terminal. Head is compressed and slightly pointed. Snout is very short. Lips arenon-fleshy and firm

Elongated, laterally compressed and back arched Dorsal Fin has 18–22 soft rays. Pectoral fin has 14–18 soft rays. Pelvic fin has 8 or 9 soft rays. Anal fin has 4–6 soft rays. Caudal fin has 19 soft rays Mouth is large and slightly oblique. Snout is long and blunt. Lipsare thick with one pair of barbles on upper lip

Silvery grey in with yellowish belly

Dark olive, shading to brownishyellow on the sides with a white belly Elongated, chubby and torpedoshaped Dorsal fin has 7–8 rays. Pectoral fin has 15–20 rays. - Anal fin has 8–10 rays. -

Color

Body shape

Common carp

Grass carp

Characters

Thala Grayish on back and flanks, silverywhite at the below side Short and deep, somewhat laterally compressed Dorsal fin has18-19 rays. - - Anal fin has 8 rays. -

Mouth is upturned. Head is broad. Snout is bluntly rounded. Lips:Upper lip is thin and covered by skin of snout. Lower lip is moderately thick.

Silver carp Greenish on the back, silvery on the belly Deep and laterally compressed Dorsal fin has 8 rays. - Anal fin has 12 rays. Caudal fin has 21–22 rays. Mouth is wide and slightly superior. Head is large and broad. Snout is short and blunt. -

Table 1 Morphological features of different fish species determined manually from Fish-Pak dataset.

Mouth is inferior. Head is isosceles. Snout is blunt. Lips:Upper lip is entire and is not continuous with lower lip.

Blackish on the dorsal side and silvery on the ventro-lateral sides

Grayish or greenish on the back and silvery at the sides and below Elongated, streamlined or laterally compressed Dorsal fin has 12 to 13 rays. - --

Mouth is terminal. Head is equilateral. Snoutis depressed and projects beyond the jaws. Lips: Lower lip is fringed and folded. Extending upper lip which covers the lower lip.

Dorsal fin has 12–13 fin rays. Pectoral fin has 17 rays. Pelvic fin has 9 rays. Anal fin has 7 rays. Caudal fin has 19 rays.

Spindle shaped

Rohu

Mori

H.T. Rauf, et al.

Computers and Electronics in Agriculture xxx (xxxx) xxxx

Fig. 1. Example head images of 6 different fish species taken from a particular position.

Fig. 2. Example body images of 6 different fish species taken from a particular position.

Fig. 3. Example scale images of 6 different fish species taken from a particular position.

2

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

Fig. 4. LeNet-5 architecture with vector length of 256 and 128 respectively.

Fig. 5. VGG-16 architecture with five consecutive convolution layers.

Fig. 6. Building block of inception model with dimension reduction.

methods to achieve a high rate of accuracy that is 96.29% (Rathi et al., 2017). The purpose of this study is to develop a framework to classify the fish species on the basis of different visual features. We develop a dataset called Fish-Pak (Shah et al., 2019) that can be accessible at Mendeley data. As a case study, we tested our framework on images of six data classes i.e. Ctenopharyngodon idella (Grass carp), Cyprinus carpio (Common carp), Cirrhinus mrigala (Mori), Labeo rohita (Rohu), Hypophthalmichthys molitrix(Silver carp), and Catla catla (Thala). Table 1 presents the different features for classification of these fish species. Fig. 1 shows the subset of head view images taken from the Fish-Pak (Shah et al., 2019) randomly. Similarly Fig. 2 and Fig. 3 contains complete body view and scale view of 12 different instances. The previous work on fish classification was carried out in different

environments and datasets. However, our work is done on Fish-Pak dataset which consists of farming fish species in the tropical areas including Pakistan. No earlier such work has been observed by the authors on these species. We have proposed CNN architecture based on 32 layers for the classification of fish species on Fish-Pak dataset. The proposed model is modified version of classical VGGNet (Simonyan and Zisserman, 2014). The proposed 32-layer CNN architecture is employed to test and train several images of fish species with six different target classes. We examine the introduced model on Fish-Pak dataset with three separate image views (head region, body shape and fin rays or scale). Furthermore, we tested and compared the proposed 32-layer CNN architecture with VGG-16 (Simonyan and Zisserman, 2014) for transfer learning, one block VGG, two block VGG, three block VGG, LeNet-5 (LeCun et al., 1998), AlexNet (Krizhevsky et al., 2012), 3

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

GoogleNet (Szegedy et al., 2015), and ResNet −50 (He et al., 2016). The accuracy achieved by the experimental evaluation has proved that the proposed CNN model is far better than the other state of the art models. 2. Literature review In the recent years, animal identification, recognition and classification has attracted several researchers due to its importance in agriculture-based economy and food security (Zhongzhi, 2019). The authors have proposed the novel model (Hu et al., 2012) for the fish species identification based on textural features, colours, and multiclass support vector machine. From the original image, the colour and textural features of the fish skin were captured. LIBSVM software were used for the selection of best feature for the accurate identification. For the best fish identification model, the Bior 4.4 wavelet filter in HSV colour space is preferred as proved by the experiment. Two one-againstone based MSVMs, DAGMSVM, and VBMSVM were created to extract the best features. Their proposed techniques show better results as compared to the state of the art methods. For the unconstrained underwater environment, the authors introduced the convolutional neural network model (Salman et al., 2016) for the variation in features, classes, environment, and intra and inter fish species variation. The authors achieved more than 90% accuracy in classification by using the

Fig. 7. Building block of residual learning.

Table 2 Detailed characteristics of Proposed 32 layer CNN architecture. Layer Number

Layer Name

Maps and Neurons

Padding

0 1

Image input Convolution

3@200 × 200 200@32 × 32

Valid Same

2 3 4

Activation Max Pooling Dropout

ReLu 200@7 × 7

– Valid –

5 6

Batch Normalization Noise

7 8

Fully Connected Convolution

32 200@64 × 64

Same Same

9 10 11

Activation Max Pooling Dropout

ReLu 200@9 × 9

– Valid –

12 13

Batch Normalization Noise

14 15

Fully Connected Convolution

64 200@128 × 128

Same Same

16 17 18

Activation Max Pooling Dropout

ReLu 200@3 × 3

– Valid –

19 20

Batch Normalization Noise

21 22

Fully Connected Convolution

128 200@256 × 256

Same Same

23 24 25

Activation Max Pooling Dropout

ReLu 200@5 × 5

– Valid –

26 27

Batch Normalization Noise

28 29 30

Flatten Fully Connected Dropout

31 32 33

Batch Normalization Soft-Max Output

2×i 10

rate@32 × 32

Batch Normalization Gaussian

2×i 10

2×i 1 10

rate

rate@64 × 64

Batch Normalization Gaussian

2×i 10

2×i 1 10

rate

rate@128 × 128

Batch Normalization

Gaussian

2×i 1 10

rate

Kernel Initializer

Kernel size



– 3×3

U[−

6 6 ] , Input Input





– 3×3 –

– –

– –

Uniform

– 9×9

– – U[−

6 6 ] , Input Input





– 3×3 –

– –

– –

Uniform

– 7×7

– – U[−

6 6 ] , Input Input

– – –

– 3×3 –

– –

– –

Uniform

– 5×5

– – U[−

6 6 , ] Input Input

– – –

– 3×3 –

– –

– –

– –

rate@256 × 256

Valid Same –

– Uniform –

– – –

Batch Normalization 6 Cross Entropy

– – Same

– – –

– – –

2×i @256 × 256 10

Batch Normalization Gaussian

2×i 1 10

rate

256 × 256 256 2×i 10

4

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

Fig. 8. Proposed 32-layer CNN architecture.

LifeCLEF14 and LifeCLEF15 benchmark fish datasets. For fresh water fish species identification, the discriminant analysis with the combination of near-infrared reflectance spectroscopy based model was developed (Lv et al., 2017). Seven fresh water fish species of the family cyprinidae were considered by near-infrared reflectance spectroscopy from 1000 nm to 1799 nm. They applied partial least square, principal component analysis, competitive adaptive reweighted sampling and fast fourier transform on linear discriminant analysis for the required classification. Ogunlana et al., developed a system based on Support Vector Machine (SVM) for fish species classification on the basis of colour, shape and different sizes of fish. The length of body and the five fins namely anal, caudal, dorsal, pelvic and pectoral were

mined in centimetre. They achieved classification accuracy of 78.59% using SVM (Ogunlana et al., 2015). Furthermore, Guney and Atasoy introduced pattern recognition methods in the electronic nose to differentiate between the species of fish. They applied different classification algorithms including binary decision tree method, Naïve Bayes and KNN. They reported binary decision tree method gives best results (Güney and Atasoy, 2015). In another study (Kratzert and Mader, 2018), the authors proposed an enhanced version of VGG-16 model for the automated classification of the underwater fish species. The authors proposed a technique based on the monitoring system of FishCam for the observation of underwater objects. The weights of the pre-trained network on the Image Net 5

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

classification process, the two baseline systems were used to represent each image with the SVM and single global CNN was applied to both the datasets (Ge et al., 2015). The authors have used different filters like Sobel, Prewitt, Robert, Laplacian and canny edge detector for edge detection of the shark fish (Shrivakshan and Chandrasekar, 2012). In the literature, researchers were used to work on the automatic fish species classification in the underwater video monitoring or classification of species based on the fish skin or any particular feature. Their proposed work based on CNN that does not require the calculation of any hand-engineered image features. Instead, those networks used the raw image as input. In our proposed work, we have developed our dataset and consider the three basic morphological features for the classification of fish species separately, i.e. (head region, body shape and fin rays or scale) referred to Table 1.

Table 3 Detailed description of the dataset filtered from Fish-Pak and used for this study. Sr.

Classes

1

Ctenopharyngodon idella (Grass carp) Cyprinus carpio (Common carp) Cirrhinus mrigala (Mori) Labeo rohita (Rohu) Hypophthalmichthys molitrix (Silver carp) Catla catla (Thala) Total Images

2 3 4 5 6

Head Region

Body Shape

Scale

Total Images

Number of Images 8 6

5

19

15 46 29 30

17 38 33 26

13 58 48 40

45 142 110 96

12 140

4 124

10 174

26 438

3. Material and methods datasets are used to keep the network unchanged. They classified 10 fish species and obtained 93% accuracy. To avoid the huge amount of training data, they pre-trained CNN by using the cross-layer pooling algorithm (Siddiqui et al., 2017) that was based on deep learning technique for the classification of fish species. The proposed model was applied to images captured by an underwater video of Western Australia and tested on the limited labelled datasets. The 94.3% fish classification accuracy achieved by performing the SVM through the proposed methods. The authors proposed an automatic classification of fish species by using five schemes (Rodrigues et al., 2015). The classifier was designed by combining several different techniques. The features were extracted by combining the PCA, SIFT, and merging SIFT, VLAD and PCA. The three input classifiers were tested by combining k-NN, SIFT class, and k-means. The data clustering was performed with the help of three algorithms that are aiNet, ARIA, and k-means. The experimental results have shown that the proposed model was better than other models in terms of overall performance. Rathi et al. presented the fish classification results using pre-processing techniques with the morphological methods and Otsu‘s thresholding with CNN (Rathi et al., 2017) on their dataset of fish species. Fish images classification has been performed by using the back propagation classifier based on the features extraction from colour signature by RGB colour space, colour histogram and grey level cooccurrence matrix. In their study, 400 training images and 210 testing images were used. The classification accuracy was 84% by using the Back Propagation classifier (Alsmadi et al., 2011). The authors used two fine-grained images data sets Fish and UEC FOOD-100. For the

3.1. Deep learning structure Nowadays, Deep Learning methods have been broadly used to perform the image classification task based on the visual features of the subject (Acharya et al., 2017; Tan et al., 2017). It incorporates both phases of image feature extraction and classification. These strategies can accomplish promising outcomes utilizing complex methods formed with huge scale information. In this research, we have proposed a novel deep learning architecture of CNN to perform the classification of 6 different species of fish. The proposed CNN is applied on the three separate characters of fish species i.e. body shape, head region, and scale. In the training process, the features extraction are performed with the help of visual information from the input images yielding a powerful profound CNN model. The proposed CNN architecture efficiently separates the unseen images of fish species during the testing stage. The definite structure of the proposed technique is depicted in the following sections. 3.2. Convolutional Neural Network (CNN) In the domain of Deep Learning, CNN is yet powerful model which is comprised of multiple network layers. Each layer holds unique characteristics to process the input data and send it to the next layer (Khan et al., 2019). The first layer is an image input layer in which the required image data is given from the external source. Further layers include convolution layer of multiple dimensions, max pooling layer, average pooling layer and rectified linear unit layer (Zhou et al., 2019;

Fig. 9. Sample image instances with transparent back ground from the dataset used for the training of proposed 32-Layer CNN architecture.

6

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

Table 4 Evaluation results achieved by comparing proposed 32-Layer CNN architecture with VGG-16 for transfer learning, one block VGG, two block VGG, three block VGG, LeNet-5, AlexNet, GoogleNet, and ResNet −50 on Body Shape images of Fish-Pak for 100, 150 and 200 Epochs. Measures

VGG-16

One Block VGG

Two Block VGG

Accuracy Precision Recall F1-Score

77.10 78.43 78.09 79.24

78.77 77.36 76.75 76.22

73.09 74.05 74.08 74.33

Accuracy Precision Recall F1-Score

77.79 79.62 79.28 80.43

79.46 78.55 77.94 77.41

73.78 75.24 75.27 75.52

Accuracy Precision Recall F1-Score

79.02 80.63 80.34 81.25

80.69 79.56 79 78.23

75.01 76.25 76.33 76.34

Accuracy Precision Recall F1-Score

77.43 78.76 78.42 79.57

79.13 77.69 77.08 76.55

73.42 74.38 74.41 74.66

Accuracy Precision Recall F1-Score

78.67 80.64 80.36 81.63

80.34 79.53 79.02 78.59

74.66 76.22 76.35 76.7

Accuracy Precision Recall F1-Score

79.93 81.35 81.45 82.06

81.6 80.28 80.11 79.04

75.92 76.97 77.44 77.15

Three Block VGG

Learning Rate lr = 0.01 Momentum = 0.8 Epochs = 100 75.32 75.43 76.99 74.11 76.14 75.04 76.01 75.31 Epochs = 150 76.01 76.12 78.18 75.3 77.33 76.23 77.12 76.5 Epochs = 200 77.24 77.35 79.19 76.31 78.39 77.29 77.94 77.32 Learning Rate lr = 0.001 Momentum = 0.9 Epochs = 100 75.65 75.76 77.32 74.44 76.47 75.37 76.34 75.64 Epochs = 150 76.89 77 79.16 76.28 78.41 77.35 78.3 77.68 Epochs = 200 78.15 78.26 79.91 77.03 79.5 78.4 78.75 78.13

Elola et al., 2019). The major benefit of deep learning CNN architecture is the self-learning and self-organization regardless of the manual supervision (Salman et al., 2019). Unlike the machine learning process, the power of deep learning based CNN model is that it is free from the manual methods of feature extraction i.e. hand crafted feature extraction or extracting features using other techniques. The simple method of feature extraction in CNN is that features are extricated progressively by utilizing fully connected layers to map the raw pixels of the image information (Labao and Naval, 2019; Xu et al., 2019). The detailed explanation of various units of CNN is given below:

Alex-Net

Google Net

ResNet-50

Proposed 32-Layer CNN

85.21 86.07 86.34 86.12

81.44 80.69 82.01 82.10

88.22 89.93 89.89 89.06

94.01 93.88 95.39 95.14

85.91 87.26 87.53 87.31

82.13 81.88 83.24 83.29

88.91 91.12 91.08 90.25

94.17 95.07 96.58 96.33

87.13 88.27 88.59 88.13

83.36 82.89 84.26 84.11

90.14 92.13 92.14 91.07

95.44 96.11 97.64 97.15

85.54 86.4 86.67 86.45

81.77 81.02 82.34 82.43

88.55 90.26 90.22 89.39

94.34 94.21 95.72 95.47

86.79 88.24 88.61 88.49

83.04 82.86 84.28 84.47

89.79 92.1 92.16 91.43

95.05 96.05 97.66 97.62

88.04 88.99 89.7 88.94

84.27 83.61 85.37 84.92

91.05 92.85 93.25 91.88

96.35 96.83 98.75 97.96

kernel window k to avoid fraction value. The Convolution layer can be defined mathematically as: (2)

Fm = Conv (V , k, Sd, Rl )

In the Eq. (2) Fm is the output feature maps of the convolution layer where, m is the input from the input image layer. Rl denotes as the ReLu, which is an activation function and characterized as y = (max , 0) . The convolution process on each of the image V = m × n is shown below:

Conv

3.2.1. Image input layer The image input layer is the primary layer of the CNN architecture. This layer takes 2-D and 3-D images as input. The size of the image must be initialized into this layer (Riesenhuber and Poggio, 1999). Let V = m × n is an input image where m and n represents the corresponding dimensions. The equation for the image input layer can be described as:

m = Inputm (V , D )

LeNet-5

(V × k )[m , n]

k [i , j] V (m i

j

i, n

j)

(3)

In the above equation V × k are corresponding image and the kernel window where m , n shows the indexes of the image. 3.2.3. Batch normalization layer This layer likely helps in increasing the network performance by boosting up the learning rate (Song et al., 2019; Scherer et al., 2010). Furthermore, it standardizes the activation of the previous layer for each batch i.e., applies a change that keeps up the mean activation near 0 and the standard deviation near 1. It takes feature maps Fm from the previous layer of convolution and normalized their activation. The layer process can be defined as:

(1)

where m is a deep learning model type which can be sequential, parallel and concerted. The layer Inputm takes an image V with total number of dimension D and performs some pre-processing if required, in order to pass to the next layer called convolution layer. 3.2.2. Convolutional layer In this layer, the convolution of the input image is performed (Song et al., 2019; Scherer et al., 2010). In order to get feature maps of the input images, a kernel window k is run over the input image V = m × n with a specific size. The stride Sd define the probability of shifting

Fm + 1 = N (Fm, ax , M )

(4)

N () : is an normalization function represented in the Eq. (4), where ax shows the axis which is mostly 1 and M is a momentum rate that decides the variation between the mean and variance. 7

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

Table 5 Evaluation results achieved by comparing proposed 32-Layer CNN architecture with VGG-16 for transfer learning, one block VGG, two block VGG, three block VGG, LeNet-5, AlexNet, GoogleNet, and ResNet −50 on Body Shape Images of Fish-Pak for 250, 300 and 350 Epochs. Measures

VGG-16

One Block VGG

Two Block VGG

Accuracy Precision Recall F1-Score

78.37 78.38 77.67 80.24

80.04 77.31 77.03 77.22

74.36 74.00 74.36 75.33

Accuracy Precision Recall F1-Score

79.31 80.71 80.68 81.25

80.97 79.63 79.34 78.23

75.29 76.32 76.67 76.34

Accuracy Precision Recall F1-Score

80.91 82.52 82.23 83.14

82.58 81.45 80.89 80.12

76.9 78.14 78.22 78.23

Accuracy Precision Recall F1-Score

78.82 78.92 78.66 81.02

80.49 77.85 77.58 78.00

74.81 74.54 74.91 76.11

Accuracy Precision Recall F1-Score

79.65 81.07 81.2 81.57

81.31 79.99 79.86 78.55

75.63 76.68 77.19 76.66

Accuracy Precision Recall F1-Score

81.52 83.37 82.89 83.73

83.19 82.3 81.55 80.68

77.51 78.99 78.88 78.79

Three Block VGG

Learning Rate lr = 0.01 Momentum = 0.8 Epochs = 250 76.59 76.71 76.94 74.06 76.42 75.32 77.01 76.31 Epochs = 300 77.52 77.63 79.26 76.38 78.73 77.63 77.94 77.32 Epochs = 350 79.13 79.24 81.08 78.2 80.28 79.18 79.83 79.21 Learning Rate lr = 0.001 Momentum = 0.9 Epochs = 250 77.04 77.16 77.48 74.6 76.97 75.87 77.79 77.09 Epochs = 300 77.86 77.97 79.62 76.74 79.25 78.15 78.26 77.64 Epochs = 350 79.74 79.85 81.93 79.05 80.94 79.84 80.39 79.77

3.2.4. Max pooling layer Max pooling operation is performed in this layer on each of the feature map Fm + 1 obtained from the batch normalized layer for reducing the size. The value of stride is typically picked by the user manually (Song et al., 2019; Scherer et al., 2010). Hence, this process typically reduces the size of output as compared to the previous size. Pool size Ps must be given to the layer as:

for k = 0, 1, 2…K

86.48 86.02 86.62 87.12

82.71 80.64 82.29 83.10

89.49 89.88 90.17 90.06

95.28 94.83 95.67 96.14

87.41 88.34 88.93 88.13

83.64 82.96 84.6 84.11

90.42 92.2 92.48 91.07

95.68 96.15 97.98 97.15

89.02 90.16 90.48 90.02

85.25 84.78 86.15 86

90.03 90.02 91.03 91.96

96.33 95.01 96.53 96.04

86.93 86.56 87.17 87.92

83.16 81.18 82.84 83.88

89.94 90.42 90.72 90.84

95.73 95.37 96.22 96.92

87.75 88.72 89.45 88.45

83.98 83.32 85.12 84.43

90.76 92.56 93.00 91.39

96.02 96.51 98.5 97.47

89.63 91.01 91.14 90.58

85.86 85.63 86.81 86.56

90.64 90.87 91.69 92.52

96.94 95.86 97.19 96.63

3.3.2. AlexNet The primary CNN architecture called AlexNet (Krizhevsky et al., 2012) was introduced to classify the objects based on their visual contents. The Image Net dataset (Tanaka and Tomiya, 2017) was utilized to test the AlexNet for the identification of different multi-label objects. Typically, AlexNet consists of 5 convolution layers and 3 consecutive fully connected layers. In the first layer, the Input image is filtered with the total number of 96 kernels and the 11x11 of the filter. In the second layer, AlexNet uses 256 number of the filter with the input image of (55, 55,256). The output obtained from the pooling and normalization of the second layer manipulated a total of 384 kernels

3.2.6. Soft-max layer This layer is an optional layer and can be applied to the input image samples depending on the nature of the data. It decreases anomalies without replacing the sample images with the updated pre-processed images. The mathematical formula for soft-max function is given below: Xi

Proposed 32-Layer CNN

3.3.1. LeNet-5 LeNet is one of the simple and first CNN architecture, proposed by LeCun et al. (1998) for document recognition. It consists of a total number of 7 layers, among which 1st , 3rd and 5th layers are the convolutional layers where 2nd and 4th are the pooling (sub-inspecting) layers. The 6th layer is a fully connected layer which is followed by the seventh output layer. There are a few well-known design decisions that have been made in LeNet-5, those are not common in the modern CNN architecture. For example, the kernel in layer 3 is not supposed to use all the features given by the layer 2. Architecture of LeNet-50 is given in Fig. 4.

(6)

K

ResNet-50

The pre trained CNN models are explained in the following Sections 3.3.1 to 3.3.5.

Yn = Fconnect (Conv (V , k, Sd, Rl ) + N (Fm, ax , M ) + Pool (Fm + 1, Ps ))

k

Google Net

3.3. Pre trained CNN models and proposed architecture

3.2.5. Fully connected layer The fully connected layer holds the network of the neurons of the previous layer connected with other layers i.e. batch normalization layer and the max pooling layer. The fully connected layer provides information about the total number of classes Yn , referred to Eq. (6), involves in the classification process as output (Serre et al., 2006).

Xi/

Alex-Net

probability.

(5)

Fm + 1 = Pool (Fm + 1, Ps )

F (Xi ) =

LeNet-5

(7)

F (Xi ) computes the probability for each of the class return by the fully connected layers Yn , and mark 1 as target class against highest 8

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

Table 6 Evaluation results achieved by comparing proposed 32-Layer CNN architecture with VGG-16 for transfer learning, one block VGG, two block VGG, three block VGG, LeNet-5, AlexNet, GoogleNet, and ResNet −50 on Head region images of Fish-Pak for 100, 150 and 200 Epochs. Measures

VGG-16

One Block VGG

Two Block VGG

Accuracy Precision Recall F1-Score

71.84 71.51 71.03 72.29

73.51 70.44 70.39 69.27

67.83 67.13 67.72 67.38

Accuracy Precision Recall F1-Score

72.89 74.39 74.42 74.63

74.55 73.31 73.08 71.61

68.87 70 70.41 69.72

Accuracy Precision Recall F1-Score

75.25 76.94 76.38 76.73

76.92 75.87 75.01 73.71

71.24 72.56 72.34 71.83

Accuracy Precision Recall F1-Score

72.41 72.42 71.46 72.85

74.07 70.93 70.82 69.78

68.39 67.62 68.15 67.89

Accuracy Precision Recall F1-Score

73.43 74.97 74.95 75.08

75.09 73.89 73.61 72.06

69.41 70.58 70.94 70.17

Accuracy Precision Recall F1-Score

75.97 77.57 76.95 77.44

77.64 76.5 75.58 74.42

71.96 73.19 72.91 72.54

Three Block VGG

LeNet-5

Learning Rate lr = 0.01 Momentum = 0.8 Epochs = 100 70.06 70.18 70.07 67.19 69.78 68.68 69.06 68.36 Epochs = 150 71.10 71.21 72.94 70.06 72.47 71.37 71.32 70.7 Epochs = 200 73.47 73.58 75.5 72.62 74.4 73.3 73.47 72.8 Learning Rate lr = 0.001 Momentum = 0.9 Epochs = 100 70.62 70.74 70.56 67.68 70.21 69.11 69.57 68.87 Epochs = 150 71.64 71.75 73.52 70.64 73.43 71.91 71.77 71.15 Epochs = 200 74.19 74.30 76.13 73.25 74.97 73.87 74.18 73.51

(3 × 3) and connected with the third convolution layer. Similarly, the fourth convolution layer uses the same number and size of kernels.

Alex-Net

Google Net

ResNet-50

Proposed 32-Layer CNN

79.95 79.15 79.98 79.17

76.18 73.77 75.65 75.15

82.96 83.01 83.53 82.11

88.75 88.96 89.03 88.19

80.99 82.02 82.67 81.51

77.22 76.64 78.34 77.49

84 85.88 86.22 84.45

89.26 89.83 91.72 90.53

83.36 84.58 84.6 83.61

79.59 79.2 80.27 79.59

84.37 84.44 85.15 85.55

89.89 90.43 90.66 89.63

80.51 79.64 80.41 79.68

76.74 74.26 76.08 75.66

83.52 83.5 83.96 82.62

89.31 89.45 89.46 88.74

81.53 82.6 83.24 81.96

77.76 77.22 78.87 77.94

84.54 86.46 86.75 84.93

89.8 90.41 92.25 90.98

84.08 85.21 85.17 84.32

80.31 79.83 80.84 80.3

85.13 85.07 85.72 86.26

90.61 91.06 91.23 90.34

maintain a necessary distance from the loss of data passing in the gradient descent and into the layer. The ResNet structure is better than average models that can incredibly enhance the accuracy of the model and quicken the training of neural networks. The ResNet thought originates from what the intensity of CNN builds, a degradation issue emerges. ResNet architecture uses congruent mapping to transfer the output of previous layer to the next layer regardless of the bias and weights. Let ai is an input to the first layer of ResNet and the output for each of the instances of will be H (a) and the required learning goal can be defined as H (a) = H (ai )- a. Fig. 7 shows the Building block of residual learning.

3.3.3. VGG -16 VGG-16 (Simonyan and Zisserman, 2014) describes a particularly profound CNN model. The principal features of this architecture are the use of several successive convolutional layers. In VGG-16 the kernel size is 3 × 3, seems smaller than other CNN architecture. This small 3 × 3 kernel helps in driving the patterns by using deep information. Unlike other CNN architecture which uses a kernel with larger sizes of 11 × 11or 7 × 7 and with 5 strides, VGG-16 preserve 3 × 3 kernels, and only 1 stride for each of the convolutional layers. The VGG-16 architecture with five consecutive convolution layers of the 3 × 3 kernel is given in Fig. 5.

4. Proposed CNN architecture Classical CNN’s are composed of two major parts. The first part includes the convolutional process and the max pooling operation. Where in the second part a fully connected layer takes the input from the output of the previous layer and performs the classification task. CNN’s used automatic feature extractor hierarchically to map the value of the visual features into the vector space. The 32-layer CNN architecture {C1, C2, C3, C4, C5…C32} is proposed that is based on the VGGNet Model. The proposed architecture is divided into five chunks of layer hierarchy as {CH1, CH2, CH3, CH4, CH5} . The first three chunks CH1, CH2, CH3 hierarchy contain 21 layers {C1, C2, C3, C4, C5…C21} , 7 of each chunk. The fourth chunk CH4 includes 8 layers {C22, C23, C24, C25…C29} and the final one CH5 have 3 layers {C30, C31, C32} . For the first three chunks, the layer hierarchy remains the same and repeated with different feature parameters. The first layer of proposed

3.3.4. GoogleNet In the 2014 ILSVRC competition, GoogleNet (Szegedy et al., 2015) utilized 22 layers based CNN architecture. It explicates 5.5% of the error rate for the top 5 classification tasks. Due to the addition of the inception layer into the network structure, GoogleNet network is more complicated than VGGNet. The inception model in GoogleNet architecture can be seen in the Fig. 6. 3.3.5. ResNet -50 Residual Neural Network (ResNet) is proposed by Kaiming (He et al., 2016). ResNet win the ILSVRC 2015 title by designs for utilizing the residual unit and effectively prepares 152 deep neural networks with 3.57% error rate to train the upper 5 class. The centre of ResNet, HighWay Nets, utilizes the skip association with let some input into (skip) the layer unpredictably to incorporate the data stream which can 9

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

Table 7 Evaluation results achieved by comparing proposed 32-Layer CNN architecture with VGG-16 for transfer learning, one block VGG, two block VGG, three block VGG, LeNet-5, AlexNet, GoogleNet, and ResNet −50 on Head Region images of Fish-Pak for 250, 300 and 350 Epochs. Measures

VGG-16

One Block VGG

Two Block VGG

Accuracy Precision Recall F1-Score

75.36 77.26 76.61 76.84

77.03 76.19 75.24 73.82

71.35 72.88 72.57 71.94

Accuracy Precision Recall F1-Score

76.01 77.74 77.63 77.61

77.68 76.67 76.26 74.59

72 73.36 73.59 72.71

Accuracy Precision Recall F1-Score

76.89 77.88 77.79 77.73

78.56 76.81 76.33 74.71

72.88 73.5 73.66 72.83

Accuracy Precision Recall F1-Score

75.71 77.65 76.93 77.12

77.38 76.58 75.56 74.1

71.7 73.27 72.89 72.22

Accuracy Precision Recall F1-Score

76.56 78.32 77.99 78.04

78.23 77.25 76.62 75.02

72.55 73.94 73.95 73.14

Accuracy Precision Recall F1-Score

77.34 78.25 78.05 78.03

79.01 77.18 76.59 75.01

73.33 73.87 73.92 73.13

Three Block VGG

LeNet-5

Learning Rate lr = 0.01 Momentum = 0.8 Epochs = 250 73.58 73.69 75.82 72.94 74.63 73.53 73.58 72.91 Epochs = 300 74.23 74.34 76.3 73.42 75.65 74.55 74.35 73.68 Epochs = 350 75.11 75.22 76.44 73.56 75.72 74.62 74.47 73.80 Learning Rate lr = 0.001 Momentum = 0.9 Epochs = 250 73.93 74.04 76.21 73.33 74.95 73.85 73.86 73.19 Epochs = 300 74.78 74.89 76.88 74 76.01 74.91 74.78 74.11 Epochs = 350 75.56 75.67 76.81 73.93 75.98 74.88 74.77 74.14

CNN architecture C1 is convolution layer that performs a convolution part and convolved with the layer input. The second layer C2 is ReLu layer which is used to maintain the important image features by removing the redundancy of the image data and the third layer C3 is max pooling layer. We added the dropout layer (Srivastava et al., 2014) as the fourth one C4 , in which some instances get dropped tentatively from the network to maintain the approximation of nodes effectively and to avoid over fitting. The dropout rate used in the proposed architecture 2×i can be defined as: For each chunk CHn where 10 i = {1, 2, 3, 4, 5} and n = {1, 2, 3, 4, 5} . Batch normalization is performed in the fifth layer C5. After dropping out some units, the feature output is given to the sixth layer C6 to remove some noise from the sample data and perform augmentation. We employed the Gaussian noise and the rate for each chunk CHn can be determined as: 2×i 1 CHn where i = {1, 2, 3, 4, 5} and n = {1, 2, 3, 4, 5} . 10 The seventh layer C7 of CH1 is fully connected layer. We include 4 fully connected layers in the proposed CNN architecture with non-linearity to ensure well interaction of the nodes of network and record for every single reasonable feature dependencies. Once, the chunk 1 CH1 is executed the predicted target classes is given to the eight layers which imply in chunk 2 CH1, and the model re-executed with the same hierarchy of layers. Layers {C8, C9, …C14} are the same layers as {C1, C2, …C7} but with different kernel size, neurons and the feature maps. Similarly, layers {C15, C16, …C21} followed the same structure as {C8, C9, …C14} . In the fourth chunk CH4 , we added the flatten layer C27 , in order to re shape the pooled feature maps into the one unique column. The last three layers in chunk CH5 are {C30, C31, C32} and represents the dropout, batch normalization and the soft-max layer, which is finally maps feature intensities into the target classes. The brief description of

Alex-Net

Google Net

ResNet-50

Proposed 32-Layer CNN

83.47 84.9 84.83 83.72

79.7 79.52 80.54 79.73

84.48 84.76 85.38 85.66

90.02 90.75 90.89 89.74

84.12 85.38 85.85 84.49

80.35 80.00 81.56 80.5

85.13 85.24 86.40 86.43

90.67 91.23 91.91 90.51

85.34 85.52 85.92 84.61

81.23 80.14 81.63 80.62

86.01 85.38 86.47 86.55

91.55 91.37 91.91 90.63

83.82 85.29 85.15 84

80.05 79.91 80.86 80.01

84.83 85.15 85.7 85.94

90.37 91.14 91.21 90.02

84.67 85.96 86.21 84.92

80.9 80.58 81.92 80.93

85.68 85.82 86.76 86.86

91.22 91.81 92.27 90.94

85.79 85.89 86.18 84.912

81.68 80.51 81.89 80.92

86.46 85.75 86.73 86.85

92.00 91.74 92.17 90.93

proposed CNN architecture is given in the Table 2. Valid padding represents no padding and the ‘Same’ padding shows that image feature dimensions will be the same for output as taken in input. The filter sizes used by each convolution layers are 32, 64, 128 and 256 respectively. Two different kernel initializer Uniform random initializer and normal truncated initializer are selected for the kernel initialization. The sequence followed by the normal truncated initializer can be defined by 6 6 6 6 the function: U[− Input , Input ], where Input , Input are upper and lower limits. Input shows the number of instances for each weight vector in the corresponding layer. We introduced pool based convolution which repeatedly convolve image to extract feature in more depth.The benefit of proposed 32-layer architecture is the determination of loss at each pool or level so the model may get train better and extract deeper features. Graphical representation of proposed 32-layer CNN architecture is presented in the Fig. 8. 5. Experiments 5.1. DataSet acquiring We have acquired the entire dataset entitled, Fish-Pak; from tropical areas as Pakistan. Initially, the labelling of Fish-Pak dataset were made by the domain expert. The dataset was originally consists of 915 images of three distinct views (head region, body shape, and scale) with the dimensions of 5184 × 3456 pixels. We have scanned Fish-Pak and filter 438 images with three accurate views from the head, body and scaling area. The detailed description of used dataset are given in the Table 3.

10

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

Table 8 Evaluation results achieved by comparing proposed 32-Layer CNN architecture with VGG-16 for transfer learning, one block VGG, two block VGG, three block VGG, LeNet-5, AlexNet, GoogleNet, and ResNet −50 on Scale Region images of Fish-Pak for 100, 150 and 200 Epochs. Measures

VGG-16

One Block VGG

Two Block VGG

Accuracy Precision Recall F1-Score

67.05 66.98 66.38 67.42

68.72 65.91 65.74 64.38

63.04 62.6 63.07 62.49

Accuracy Precision Recall F1-Score

68.01 69.41 69.53 70.06

69.65 68.33 68.21 67.04

63.97 65.02 65.52 65.15

Accuracy Precision Recall F1-Score

68.14 68.83 69.27 69.62

70.81 70.76 71.90 71.64

64.13 64.45 64.23 64.72

Accuracy Precision Recall F1-Score

67.81 67.55 67.04 67.96

69.47 66.48 66.4 64.92

63.79 63.17 63.73 63.03

Accuracy Precision Recall F1-Score

68.55 69.89 70.05 70.48

70.14 68.81 68.73 67.46

64.46 65.56 66.04 65.57

Accuracy Precision Recall F1-Score

68.65 69.26 69.72 70.25

71.32 71.19 72.35 72.22

64.64 64.88 64.68 65.36

Three Block VGG

LeNet-5

Learning Rate lr = 0.01 Momentum = 0.8 Epochs = 100 65.27 65.39 65.54 62.66 65.13 64.03 64.17 63.47 Epochs = 150 66.24 66.31 67.96 65.08 67.58 66.48 66.75 66.13 Epochs = 200 67.36 66.47 67.41 68.51 67.36 68.12 67.36 69.71 Learning Rate lr = 0.001 Momentum = 0.9 Epochs = 100 66.02 66.14 66.11 63.23 65.79 64.69 64.71 64.01 Epochs = 150 66.73 66.8 68.44 65.56 68.1 67.00 67.17 66.55 Epochs = 200 67.87 66.98 67.84 68.94 67.81 68.57 67.94 70.29

Alex-Net

Google Net

ResNet-50

Proposed 32-Layer CNN

75.16 74.62 75.33 74.28

71.39 69.24 71.87 70.26

78.17 78.48 78.88 77.22

83.96 84.43 84.38 83.35

76.12 77.04 77.78 76.94

72.32 71.66 73.45 72.92

79.10 80.94 81.33 80.88

84.36 84.85 85.83 84.96

76.25 77.47 77.14 78.50

72.48 72.61 72.16 72.48

79.96 80.33 80.04 80.44

84.78 84.32 84.55 84.52

75.91 75.19 75.99 74.82

72.14 69.81 72.53 70.82

78.92 79.05 79.54 77.76

84.71 85.00 85.04 83.89

76.61 77.52 78.35 77.36

72.81 72.14 73.97 73.34

79.59 81.42 81.85 81.32

84.85 85.33 86.35 85.38

76.76 77.97 77.59 79.08

72.99 73.04 72.61 73.06

80.47 80.76 80.49 81.02

85.29 84.75 85.00 85.19

region (112), body shape (108), and scale (139). The options set for the training of proposed 32-Layer architecture are:

5.2. Camera specification & prepossessing To capture images, an automated camera (Canon EOS 1300D) with a sensor kind of CMOS supporting the resolutions of 5202 × 3465 (Mpix) and the sensor size of 14.9 × 22.3 (mm) are used. The mode setting of camera was Scene, with the choice of sub-classification as Snow scene, as it exhibited the best mode for the artificial light state of the case; with 14 mega pixels picture measure (5184 × 3456 pixels) in 3:2 degrees, glimmer and face disclosure deactivated for the position when catching fish body and scale. Each image used for the experiment is converted into the size of 200 × 200. In order to get the high foreground object contrast and to avoid unwanted multiple background light intensities, the dataset is pre-processed to make each image background transparent as shown in Fig. 9.

5.3.1. Momentum Momentum considers past gradients to flatten out the means of gradients descent. It can be connected with stochastic gradient descent, mini-batch gradient, and gradient descent. We used gradient descent solver momentum to train our dataset using the proposed and state of the art CNN architectures. We have examined momentum values from 0.5 to 0.99 to accelerate the optimization process and considered the number of epochs while selecting the best momentum value. On hit and trial basis momentum value 0.8 and 0.9 gave better results however the momentum of 0.9 explicit large steps size as compared to 0.8. 5.3.2. Iterations A subjective cut-off, commonly, characterized as ”one turn for executing the whole dataset”, applied to classify training into distinct stages, which is essential for logging and alternate estimation. In our experiment, to ensure the integrity of our proposed model, we used 100, 150, 200, 250,300 and 350 epochs for the training and testing of the dataset.

5.3. Training and testing Our experiments are performed on 438 number of total images from the head region (140), body shape (124), and scale (174) separately. The training of 348 images is carried out and the 90 images are chosen for the testing of the proposed CNN classifier. K-fold cross-validation method is employed to train and validate image samples. The value for each fold k is set to 10 which means total 10 chunks will be used for the partitioning of the training and testing data differently. The fold, in three experiments head region (140), body shape (124), and scale (174), that is retained for the testing purpose contains 90 images for each chunk. The number of images in the testing fold for each chunk are as: head region (28), body shape (16), and scale (35). At each value of k, the 9-fold that’s are reserved for the training of data contains a head

5.3.3. Learning rate For each CNN model, the learning rate lr produces a prominent impact on the overall accuracy rate. Too low learning rate lr may cause the slow execution rate of the entire model and hence the time complexity tends to increase. While using too large learning rate lr, there is a fair chance that the model gets stuck at the suboptimal results. The purpose of testing the proposed model on different learning rates lr is to achieve faster convergence rate on a higher number of 350 epochs. we 11

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

Table 9 Evaluation results achieved by comparing proposed 32-Layer CNN architecture with VGG-16 for transfer learning, one block VGG, two block VGG, three block VGG, LeNet-5, AlexNet, GoogleNet, and ResNet −50 on Scale Region images of Fish-Pak for 250, 300 and 350 Epochs. Measures

VGG-16

One Block VGG

Two Block VGG

Accuracy Precision Recall F1-Score

69.09 69.78 70.22 70.57

71.71 71.89 73.15 72.68

65.03 65.58 65.36 65.76

Accuracy Precision Recall F1-Score

69.53 70.22 70.66 71.01

72.15 72.33 73.59 73.12

65.47 66.02 65.8 66.22

Accuracy Precision Recall F1-Score

69.94 70.59 71.03 71.38

72.52 72.7 73.96 73.49

65.67 66.22 66 66.42

Accuracy Precision Recall F1-Score

69.09 69.78 70.22 70.57

71.71 71.89 73.15 72.68

65.03 65.58 65.36 65.76

Accuracy Precision Recall F1-Score

69.53 70.22 70.66 71.01

72.15 72.33 73.59 73.12

65.47 66.02 65.82 66.22

Accuracy Precision Recall F1-Score

70.54 71.33 71.66 72.13

73.12 73.41 74.59 74.24

66.27 66.93 66.63 67.17

Three Block VGG

Learning Rate lr = 0.01 Momentum = 0.8 Epochs = 250 68.26 67.37 68.54 69.64 68.49 69.25 68.4 70.75 Epochs = 300 68.79 67.81 68.98 70.08 68.93 69.69 68.84 71.19 Epochs = 350 69.21 68.52 69.41 70.79 69.35 70.34 69.26 71.19 Learning Rate lr = 0.001 Momentum = 0.9 Epochs = 250 68.26 67.37 68.54 69.64 68.49 69.25 68.4 70.75 Epochs = 300 68.79 67.81 68.98 70.08 68.93 69.69 68.84 71.19 Epochs = 350 69.81 69.12 70.12 71.54 69.98 70.97 70.01 71.94

have tried 5 different lr values of 0.1, 0.01, 0.001, 0.0001 and 0.00001 and we found best results on the optimal ones that are 0.01 and 0.001 with a combination of the different number of epochs and the momentum value. Both learning rates influence the accuracy results approximately with the same ratio.

Accuracy =

Recall =

TP TP + FN

Google Net

ResNet-50

Proposed 32-Layer CNN

77.15 78.6 78.27 79.54

73.38 73.74 73.29 73.52

80.86 81.46 81.17 81.48

85.68 85.45 85.68 85.56

77.59 79.04 78.71 79.98

73.82 74.18 73.73 73.96

81.35 81.98 81.61 81.92

86.12 85.89 86.12 86.44

78.3 79.75 79.42 80.69

74.41 74.77 74.32 74.55

81.94 82.57 82.22 82.51

86.71 86.48 86.71 87.03

77.15 78.6 78.27 79.54

73.38 73.74 73.29 73.52

80.86 81.46 81.17 81.48

85.68 85.45 85.68 85.56

77.59 79.04 78.71 79.98

73.82 74.18 73.73 73.96

81.35 81.98 81.61 81.92

86.12 85.89 86.12 86.44

78.9 80.46 80.05 81.44

75.01 75.48 74.95 75.39

82.54 83.28 82.85 84.03

87.33 87.19 87.37 87.78

Recall × Precision Recall + Precision

(10)

TP + TN TP + FP + TN + FN

(11)

6. Results To ensure the effectiveness of our proposed 32-Layer CNN architecture, we have carried out the experimental comparison in three phases. In the first phase, the proposed model is compared to other traditional and state of the art CNN architecture on the dataset with view fish body shape. This experiment is done with 100, 150, 200, 250, 300 and 350 epochs as shown in Table 4 and 5. The learning rate was set to 0.01 and 0.001. The momentum rate was selected as 0.8 and 0.9. The maximum accuracy of 96.63 % achieved for 350 epoch, lr-0.001, and momentum = 0.9 on body view datasets with the proposed 32Layer CNN architecture. The second phase consists of the comparative evaluation of proposed 32-Layer CNN architecture with VGG-16 for transfer learning, one block VGG, two block VGG, three block VGG, LeNet-5, AlexNet, GoogleNet, and ResNet −50 on Head region images of Fish-Pak for 100, 150, 200, 250, 300 and 350 epochs. Similarly, the same learning rate and momentum have been chosen as in the first experimental phase. Table 6 and 7 present the experimental results for the Head region dataset. The maximum accuracy of 92.00% achieved for the proposed architecture. In the last phase of experiment evaluation, we have utilized the dataset of images with a scale view of fish species. The maximum accuracy of 87.33 % obtained by comparing proposed 32-Layer CNN architecture for 100, 150, 200, 250, 300 and 350 epochs with lr = 0.001 and momentum = 0.9. Results of third

To validate the performance of proposed 32-Layer CNN architecture with other state of the art CNN classifiers like VGG-16 for transfer learning, one block VGG, two block VGG, three block VGG, LeNet-5, AlexNet, GoogleNet, and ResNet −50, we used the accuracy measures. The accuracy measure used for the comparison is Accuracy, Precision, Recall, and F1-Score. Precision refers to the ratio of a correctly predicted class instance (TP) to the total number of instance that is predicted by the specific class (TP + FP). Where the Recall is the ratio of a correctly predicted class instance (TP) to the total number of instance associated with all classes (TP + FN), in our case there are 6. When both Recall and Precision are averaged according to their weight, it becomes F1-Score. Accuracy relates as the sum of correctly and incorrectly predicted class instance (TP + TN) to the total instances (TP + FP + FN + TN). The equations for each accuracy measure can be found in Eq. (8)–(11).

TP TP + FP

Alex-Net

F1-Score = 2 ×

5.4. Accuracy measures

Precision =

LeNet-5

(8) (9) 12

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

Fig. 10. Deep representation of first six filters of each convolution layer in the proposed 32-Layer architecture, (a) 1st Conv 3 × 3, (b) 2nd Conv 9 × 9, (c) 3rd Conv 7 × 7 and (d) 4th Conv 5 × 5.

phase of experiment are given in Table 8 and 9.

be a useful technique to follow the intermediate process of the CNN architecture. In the neural network oriented process, the filters refer to the weights of image for each convolutional layer. Due to the specific two dimensional shape of the learned filters, weight units are spatially linked to each other. In Fig. 10, we have represented the deep

6.1. Visualization of filters and feature maps Visualization of the filters and the corresponding feature maps can 13

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

Fig. 11. Sample input feature representation of 2nd Conv layer of proposed model with 64 maps.

Fig. 12. Sample input feature representation of 3rd Conv layer of proposed model with first 64 maps.

14

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

(89.74%) with VGG-16 for transfer learning (76.84%), one block VGG (73.82%), two block VGG (71.94%), three block VGG (73.58%), LeNet5 (72.91%), Alex Net (83.72%), Google Net (79.73%), and ResNet −50 (85.66%) on Head Region images of Fish-Pak for 250, 300 and 350 epochs and lr = 0.01, momentum = 0.8 demonstrated in Table 8, which shows that our model performed well in terms of better correct classification ratio and overall performance. The accuracy of obtained on the scaling region or fin rays images dataset is less as compared to the accuracy of head view images and body view images. From Table 8 and 9, gained precision and recall measures for the 350 epochs, lr = 0.01and momentum = 0.8 are proposed 32-Layer CNN architecture (85.89 %, 86.12%), VGG for transfer learning (70.66%,), one block VGG (72.33%, 73.59%), two block VGG (66.02%, 65.82%), three block VGG (68.98%, 68.93%), LeNet-5 (70.08%, 69.69%), AlexNet (79.04%, 78.71%), GoogleNet (74.18%, 73.73%), and ResNet −50 (81.98%, 81.61%). We can see from Table 10, our model gets the highest accuracy of 87.33% with the higher number of 350 epochs over traditional and state of the art CNN architectures. Referred to Table 5–9, the period changes in the accuracy of our proposed model with learning rate lr (0.01, 0.001) are (96.33%, 96.94%) on body view image dataset, (91.55%, 92.00%) on head view image dataset and (86.71%, 87.33%) over 350 epochs. The ROC curves for FPR (False positive rate) and TPR (True positive rate) against all classes of the body view dataset are given in Fig. 13. We also observed that the parameter of proposed 32-Layer architecture is far greater than other traditional CNN architectures. Hence, the proposed model needs more time to optimize all parameters there for we decrease the learning rate lr = 0.001 to boost up the better convergence rate. Table 10 demonstrate the total number of the parameter needs to be trained for each of the traditional and proposed CNN architecture. The ROC curves from Fig. 13 supporting the empirical results achieved by comparing proposed 32-Layer CNN architecture with VGG-16 for transfer learning, one block VGG, two block VGG, three block VGG, LeNet-5, AlexNet, GoogleNet, and ResNet −50 on body view images against six target classes. We can see in Fig. 13 (a), the proposed model achieved higher AUC of 0.969 which is near to 1 as compared to other models for the class Ctenopharyngodon idella (Grass carp). Higher AUC of the proposed model exhibits its excellent measure of separability. On the other hand, for the same class, two block VGG model has AUC of 0.752 which far from 1 and reflecting the worst measure of separability. In terms of Cirrhinus mrigala (Mori) class Fig. 13, (c) the ROC curve more nearby to the upper right corner for proposed a 32-layer model achieved greater than 95% sensitivity and specificity as opposed to VGG-16 and all their variants which shows the curves less nearby to the upper right corner. ROC curves of remaining target classes also confirmed that the proposed 32-layer model outperformed other models and achieved better TPR of 0.952(b), 0.971(c), 0.966(d), 0.956(e) and 0.972(f). After the proposed 32-layer model, ResNet-50 appeared as 2nd best model obtained AUC of 0.903(a), 0.892(b), 0.903(c), 0.897(d), 0.905(e) and 0.896(f) referred to Fig. 13.

Table 10 Number of parameters trained against each model. Model

Number of Parameter

VGG-16 One Block VGG Two Block VGG Three Block VGG LeNet-5 Alex-Net Google Net ResNet-50 Proposed 32-Layer CNN

6.2M 40.9M 20.5M 10.3M 4.2M 247.4M 165.9M 264.7M 404.4M

representation of 4 different filters used for the 4 convolutional layers in the proposed 32-Layer architecture. We can see from Fig. 10, the squares with dark shade represent the small weights of the network while the squares with light shades show the large weights of the network. The spatial relation on the last row identifies a gradient to dark in the bottom right from light in the top left. In the more depth visualization of deep processing, feature maps are the best to examine the impact of applying the filter on the input image. The feature maps or activation maps identify all those features extracted from the image with each level of the convolutional process. In Fig. 10 and 11, we have shown the 64 high-level feature representation from the 2nd and 3rd convolution layers of the proposed 32-Layer CNN architecture. We can see in Fig. 11, features are closer to the input images and can easily interpretable while in Fig. 12 the maps are close to the output of the model, it becomes low level and not human interpretable. 6.2. Discussions Despite the effective performance, achieved by the proposed 32Layer CNN architecture, it is not without limitations. As the size of the dataset is small where the testing epochs meets to 350. Furthermore, the gradient decent momentum and the learning rate also triggered the accuracy performance. As we can see from Table 4, the proposed CNN architecture obtained 94.01% and 95.44% accuracy on body view images with momentum rate of 0.8, lr = 0.01 and iterations of 100 and 200 respectively. The accuracy increased up to 94.34% and 96.35% for momentum rate of 0.9, lr = 0.001. Similarly, from Table 5, our model obtained an accuracy of 95.28%, 95.68%, and 96.33% when it examined on epochs of 250, 300and 350 with lr = 0.01 and momentum = 0.8. On the higher iterations with lr = 0.001 and momentum = 0.9 the accuracy becomes 95.73%, 96.02% and 96.94%. Referred to Table 6, after the proposed model, ResNet-50 performed better than others as it gets 89.94%, 90.76% and 90.64% for 250, 300and 350 epochs and lr = 0.001, momentum = 0.9. The chronological decreasing order of performance is as: Proposed 32-Layer CNN > ResNet50 > Alex-Net > Google Net > One Block VGG > VGG-16 for transfer learning > Three Block VGG > LeNet-5 > Two Block VGG. Table 6 and 7 demonstrate the excellent results achieved by the proposed model on the head region dataset of fish species. Our model obtained F1- scores of 88.19%, 89.26% and 89.63% for 100, 150 and 200 epochs, lr = 0.01 and momentum = 0.8 sequentially. The overall performance of the head view dataset is less than the performance of the proposed model on the body view fish species datasets. As the number of features in the body region of fish species is higher than the head region, this difference of features produced a high impact on the classification accuracy. As like the experimentation of body view dataset, the proposed model retained the same accuracy behaviour, from lower epochs to higher epochs, from lr = 0.01 to 0.001 and from momentum values 0.8 to 0.9 on the head view dataset. The F1-sore accomplished by comparing proposed 32-Layer CNN architecture

7. Conclusion In this article, we proposed a fish species classification method based on 32-Layer VGGNet supervised by 5 deeply controlled chunks of subnetwork hierarchy. A deep hierarchal supervision mechanism was presented by adding three additional convolutional layers with different convolutional kernels and hidden layer features. We also included drop out and flatten effect to each chunk of the proposed network. This approach not only targeted the optimization difficulties faced by VGGNet during the training but also enhanced the rich feature extraction ability of the network. Comprehensive experiments confirmed that our proposed CNN model is superior to the traditional methods for fish species classification. Also, we compared the proposed model with different variations of learning rate and momentum rates on 15

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

Fig. 13. ROC curves of proposed 32-Layer CNN architecture and other CNN models on different target classes of body view image dataset for 300 epochs and lr = 0.001 (a) Ctenopharyngodon idella (Grass carp), (b) Cyprinus carpio (Common carp), (c) Cirrhinus mrigala (Mori), (d) Labeo rohita (Rohu), (e) Hypophthalmichthys molitrix (Silver carp), (f) Catla catla (Thala).

16

Computers and Electronics in Agriculture xxx (xxxx) xxxx

H.T. Rauf, et al.

higher epochs which showed excellent identification results. In future, our aim is to apply the proposed method on the fish disease identification which may have potential benefits for aquaculture industry.

Olsvik, E., Trinh, C.M., Knausgård, K.M., Wiklund, A., Sørdalen, T.K., Kleiven, A.R., Jiao, L., Goodwin, M., 2019. Biometric fish classification of temperate species using convolutional neural network with squeeze-and-excitation. In: International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Springer, pp. 89–101. Oosting, T., Star, B., Barrett, J.H., Wellenreuther, M., Ritchie, P.A., Rawlence, N.J., 2019. Unlocking the potential of ancient fish dna in the genomic era. Evol. Appl. Piechaud, N., Hunt, C., Culverhouse, P.F., Foster, N.L., Howell, K.L., 2019. Automated identification of benthic epifauna with computer vision. Mar. Ecol. Prog. Ser. 615, 15–30. Rathi, D., Jain, S., Indu, S., 2017. Underwater fish species classification using convolutional neural network and deep learning. In: 2017 Ninth International Conference on Advances in Pattern Recognition (ICAPR). IEEE, pp. 1–6. Riesenhuber, M., Poggio, T., 1999. Hierarchical models of object recognition in cortex. Nat. Neurosci. 2, 1019. Rodrigues, M.T., Freitas, M.H., Pádua, F.L., Gomes, R.M., Carrano, E.G., 2015. Evaluating cluster detection algorithms and feature extraction techniques in automatic classification of fish species. Pattern Anal. Appl. 18, 783–797. Salman, A., Jalal, A., Shafait, F., Mian, A., Shortis, M., Seager, J., Harvey, E., 2016. Fish species classification in unconstrained underwater environments based on deep learning. Limnol. Oceanogr. Methods 14, 570–585. Salman, A., Siddiqui, S.A., Shafait, F., Mian, A., Shortis, M.R., Khurshid, K., Ulges, A., Schwanecke, U., 2019. Automatic fish detection in underwater videos by a deep neural network-based hybrid motion learning system. ICES J. Mar. Sci. Salman, A., Maqbool, S., Khan, A.H., Jalal, A., Shafait, F., 2019. Real-time fish detection in complex backgrounds using probabilistic background modelling. Ecol. Inform. 51, 44–51. Scherer, D., Müller, A., Behnke, S., 2010. Evaluation of pooling operations in convolutional architectures for object recognition. In: International Conference on Artificial Neural Networks. Springer, pp. 92–101. Serre, T., Wolf, L., Poggio, T., 2006. Object recognition with features inspired by visual cortex, Technical Report. Massachusetts Inst of Tech Cambridge Dept of Brain and Cognitive Sciences. Shah, S.Z.H., Rauf, H.T., IkramUllah, M., Bukhari, S.A.C., Khalid, M.S., Farooq, M., Fatima, M., 2019. Fish-pak: Fish species dataset from pakistan for visual features based classification. Mendeley Data v3. Shrivakshan, G., Chandrasekar, C., 2012. A comparison of various edge detection techniques used in image processing. Int. J. Comput. Sci. Iss. (IJCSI) 9, 269. Siddiqui, S.A., Salman, A., Malik, M.I., Shafait, F., Mian, A., Shortis, M.R., Harvey, E.S., 2017. Automatic fish species classification in underwater videos: exploiting pretrained deep neural network models to compensate for limited labelled data. ICES J. Mar. Sci. 75, 374–389. Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556. Song, S., Que, Z., Hou, J., Du, S., Song, Y., 2019. An efficient convolutional neural network for small traffic sign detection. J. Syst. Architect. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Tanaka, A., Tomiya, A., 2017. Detection of phase transition via convolutional neural networks. J. Phys. Soc. Jpn. 86, 063001. Tan, J.H., Acharya, U.R., Bhandary, S.V., Chua, K.C., Sivaprasad, S., 2017. Segmentation of optic disc, fovea and retinal vasculature using a single convolutional neural network. J. Comput. Sci. 20, 70–79. Tseng, C.-H., Kuo, Y.-F., 2019. Detecting and counting harvested fish and measuring fish body lengths in video using deep learning methods. In: 2019 ASABE Annual International Meeting, American Society of Agricultural and Biological Engineers, 2019, p. 1. Xu, L., Bennamoun, M., An, S., Sohel, F., Boussaid, F., 2019. Deep learning for marine species recognition. In: Handbook of Deep Learning Applications. Springer, pp. 129–145. Zhongzhi, H., 2019. Computer Vision-Based Agriculture Engineering. CRC Press. Zhou, C., Xu, D., Chen, L., Zhang, S., Sun, C., Yang, X., Wang, Y., 2019. Evaluation of fish feeding intensity in aquaculture using a convolutional neural network and machine vision. Aquaculture 507, 457–465.

Declaration of Competing Interest The authors declared that there is no conflict of interest. Acknowledgments This research is free from any research grant/fund provided by any of the institution. We would like to express our gratitude to the anonymous reviewer for helping out to furnish this research article more precisely. Appendix A. Supplementary material Supplementary data associated with this article can be found, in the online version, at https://doi.org/10.1016/j.compag.2019.105075. References Acharya, U.R., Fujita, H., Lih, O.S., Hagiwara, Y., Tan, J.H., Adam, M., 2017. Automated detection of arrhythmias using different intervals of tachycardia ecg segments with convolutional neural network. Inform. Sci. 405, 81–90. Alsmadi, M.K., Omar, K.B., Noah, S.A., et al., 2011. Fish classification based on robust features extraction from color signature using back-propagation classifier. J. Comput. Sci. 7, 52. dos Santos, A.A., Gonçalves, W.N., 2019. Improving pantanal fish species recognition through taxonomic ranks in convolutional neural networks. Ecol. Inform. 100977. Elola, A., Aramendi, E., Irusta, U., Picón, A., Alonso, E., Owens, P., Idris, A., 2019. Deep neural networks for ecg-based pulse detection during out-of-hospital cardiac arrest. Entropy 21, 305. Ge, Z., McCool, C., Sanderson, C., Corke, P., 2015. Modelling local deep convolutional neural network features to improve fine-grained image classification. In: 2015 IEEE International Conference on Image Processing (ICIP). IEEE, pp. 4112–4116. Güney, S., Atasoy, A., 2015. Study of fish species discrimination via electronic nose. Comput. Electron. Agric. 119, 83–91. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Hu, J., Li, D., Duan, Q., Han, Y., Chen, G., Si, X., 2012. Fish species classification by color, texture and multi-class support vector machine using computer vision. Comput. Electron. Agric. 88, 133–140. Khan, S., Islam, N., Jan, Z., Din, I.U., Rodrigues, J.J.C., 2019. A novel deep learning based framework for the detection and classification of breast cancer using transfer learning. Pattern Recogn. Lett. 125, 1–6. Kratzert, F., Mader, H., 2018. Fish species classification in underwater video monitoring using convolutional neural networks, eartharxiv.org. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105. Labao, A.B., Naval Jr, P.C., 2019. Cascaded deep network systems with linked ensemble components for underwater fish detection in the wild. Ecol. Inform. 52, 103–121. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al., 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324. Lu, Y.-C., Tung, C., Kuo, Y.-F., 2019. Identifying the species of harvested tuna and billfish using deep convolutional neural networks. ICES J. Mar. Sci. Lv, H., Xu, W., You, J., Xiong, S., 2017. Classification of freshwater fish species by linear discriminant analysis based on near infrared reflectance spectroscopy. J. Near Infrared Spectrosc. 25, 54–62. Ogunlana, S., Olabode, O., Oluwadare, S., Iwasokun, G., 2015. Fish classification using support vector machine. Afr. J. Comput. ICT 8, 75–82.

17