TA-CNN: Two-way Attention Models in Deep Convolutional Neural Network for Plant Recognition
Communicated by Dr Xiaoming Liu
Accepted Manuscript
TA-CNN: Two-way Attention Models in Deep Convolutional Neural Network for Plant Recognition Youxiang Zhu, Weiming Sun, Xiangying Cao, Chunyan Wang, Dongyang Wu, Yin Yang, Ning Ye PII: DOI: Reference:
S0925-2312(19)30944-0 https://doi.org/10.1016/j.neucom.2019.07.016 NEUCOM 21000
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
29 January 2018 21 May 2019 2 July 2019
Please cite this article as: Youxiang Zhu, Weiming Sun, Dongyang Wu, Yin Yang, Ning Ye, TA-CNN: Two-way Convolutional Neural Network for Plant Recognition, https://doi.org/10.1016/j.neucom.2019.07.016
Xiangying Cao, Chunyan Wang, Attention Models in Deep Neurocomputing (2019), doi:
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
TA-CNN: Two-way Attention Models in Deep Convolutional Neural Network for Plant Recognition
CR IP T
Youxiang Zhua , Weiming Suna , Xiangying Caoa , Chunyan Wanga , Dongyang Wua , Yin Yangb , Ning Yea,∗ a
AN US
College of Information Science and Technology, Nanjing Forestry University, No.159 Longpan Road, Nanjing, 210037, PR China b Department of Electrical and Computer Engineering, The University of New Mexico, Albuquerque, NM 87131, USA
Abstract
AC
CE
PT
ED
M
Automatic plant recognition using AI is a challenging problem. In addition to the recognition of the plant specimen, we also want to recognize the plant type in its actual living environment, which is more difficult because of the background noise. In this paper, we propose a novel method, referred to as the two-way attention model using the deep convolutional neural network. As the name implies, it has two ways of attentions: the first attention way is the family first attention, which is based on the standard plant taxonomy and aims to recognize the plant’s family. Specifically, we create plant family labels as another objective of the learning under the multi-task learning framework. To deal with conflicting prediction of family and species labels, we propose an implicit tree model with a dedicated loss function to maintain the correspondence between family and species labels. The second attention way is the max-sum attention, which focuses on the discriminative features of the input image by finding the max-sum part of the fully convolutional network heat map. Because these two ways of attention are compatible, we combine both discriminative feature learning and part based attention. The experiments on four challenging datasets (i.e. Malayakew, ICL, Flowers 102 and CFH plant) confirm the effectiveness of our method – the recognition accuracy over those four dataset reaches 99.8%, 99.9%, 97.2% and 79.5% respectively. ∗
Corresponding author Email address:
[email protected] (Ning Ye)
Preprint submitted to Neurocomputing
July 19, 2019
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 1: Example of three kinds of plants, which look similar and hard to tell. They belong to the same family (Malvaceae) according to the modern plant taxonomy.
M
Keywords: Convolutional neural network, Plant recognition, Attention Model
ED
1. Introduction
AC
CE
PT
For a given image of a plant, one would often like to infer the name of its species. It may be trivial for some cases. However as shown in Figure 1, some plants highly resemble each other making it difficult to correctly identify their respective species even for experienced botanists. Modern plant categorization is based on the plant taxonomy. It divides plants into hierarchical classes, which can be queried recursively according to discriminative plant features. Some of those features are associated with local (and thus tiny) morphological patterns of the plant. Naturally, a close observation typically leads to a better result. As shown in Figure 2, plant recognition typically has two tasks: one is the recognition based on a clean plant specimen image, and the other one is the recognition of a plant image taken in its living environment. Unlike general object recognition such as the ImageNet [1], which only needs to output a high-level object label of the input image, plant recognition needs to identify subtle difference among fine-grained classes. Because of the background noise 2
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 2: There are two kinds of plant recognition tasks: the specimen recognition (left) and the real-environment recognition (right).
AC
CE
PT
ED
M
in the real-world environment, the recognition based on real plant image is more difficult. There exist a large volume of traditional machine learning methods for plant recognition, which focus on plant shape [2, 3, 4, 5], texture [6, 7, 8, 9], venation features [10, 11], or considering them jointly [12, 13]. However, with the latest advance of deep learning techniques, these methods have become outdated and outperformed [14]. Instead of relying on handcrafted features, deep convolutional neural networks (CNNs) extract features automatically in an end-to-end manner. While CNNs are skilled at learning features that may even be ignored by humans, they may also suffer from the overfitting issue. To address this problem, we introduce the plant family labels as an auxiliary supervised learning objective during the training. Also, with unsupervised object localization, we extract a subregion of the plant of interest to relieve the influence from the background noise. Our technical contributions can be summarized as follows: 1. We verify that state-of-art CNNs have extraordinary performance when applied to specimen recognition tasks with the superior accuracy of 99%. 2. We propose a novel method called family first model based on modern botany taxonomy, which extracts discriminative features in the multitask learning framework. 3
ACCEPTED MANUSCRIPT
CR IP T
3. We construct max-sum attention CNNs by slightly modifying the target network, which can detect discriminative region for image classification. 4. We combine the above two methods using a novel two-way attention model, which is tested on four datasets (Malayakew, ICL, Flowers 102 and CHF plant), and achieves state-of-art performance on all of these datasets. 2. Related work
ED
M
AN US
There are many works on plant recognition using deep learning based methods. S.H. Lee et al. [15] apply deep learning to plant identification and proposed a new hybrid model which exploiting the correspondence of different contextual information about leaf features. G.L. Grinblat et al. [16] report a successful application of deep learning to the area of plant identification from leaf vein patterns. P. Barr et al. [17] developed a CNN-based plant identification system called LeafNet. However, these works mainly focus on the specimen recognition task, the real-environment recognition task, which is more challengeable, still remains to be solved. Fortunately, some studies for fine-grained image recognition in similar domains (e.g., Animals, cars, food) can be a great help to solve the plant recognition problem in the real-world environment. These methods belong to 2 classes: discriminative feature learning and part based attention.
CE
PT
Discriminative feature learning Discriminative feature learning aims to learn stronger feature representation. T. Lin et al. [18] proposed a bilinear CNN structure and achieved state-of-the-art performance in bird recognition [19]. X. Zhang et al. [20] unify CNN with Fisher Vector [21] and improves the classification result on dog [22] and birds [19]. F. Zhou et al. [23] exploit the rich class relationships by using Bipartite-Graph labels and gain good performance in food dataset.
AC
Part based attention Part based attention is to locate discriminative part of the input image. Some early proposed strongly supervised methods [24, 25, 26] need extra annotations of the bounding box and part annotations which makes them hard to generalize. To solve this problem, numerous weakly supervised methods without manual labeling are proposed. X. Liu et al. [27] proposed Fully Convolutional Attention Network for finegrained recognition and train it with reinforcement learning strategy. G. Xie 4
ACCEPTED MANUSCRIPT
AN US
CR IP T
et al. [28] proposed LG-CNN from local parts to global discrimination and achieve 96.6% test accuracy on the flower 102 dataset [29]. J. Fu et al. [30] proposed a recurrent attention convolutional neural network recursively learn discriminative region attention and region-based feature representation at multiple scales. Different from the methods above, our model combines both discriminative feature learning and part based attention. For discriminative feature learning, we introduce the family labels as another objective to learn; for part based attention, we propose the Max-sum Attention model to find discriminative part of the input image. Also, note that different from [23] needs extra work to label Bipartite-Graph, our method can get the family labels nearly cost-free. Finally, we combine two methods with the Two-way Attention Model. The details are described in Section 3. 3. The Two-way Attention Model
M
In this section, we describe the proposed two-way attention model, which is compatible with most popular network architectures like AlexNet [31], VGG16 [32], ResNet [33], and Xception [34]. Our model has two ways of attention: one focuses on plant family type, and the other focuses on the discriminative part of the input image.
AC
CE
PT
ED
3.1. Cross-entropy loss Given a set of input m images {X1 , X2 , . . . , Xm } and their ground truth labels {y1 , y2 , . . . , ym }, which belong to k classes, we input the i-th image Xi into a target network. The forward pass of the neural network is denoted as f (Xi ) = W ∗ Xi , where ∗ is a set of convolution, pooling, and activation operations, and W encodes all the network parameters. W contains two parts: Wc and Wf , which are for the parameters of convolution layers and fullyconnected layers, respectively. The target network is trained by minimizing the cross-entropy loss, defined as: Li =
m X i=1
− log (pi ),
(1)
where pi is the softmax score, which is defined as: efu pi = Pk
fj j=1 e
5
.
(2)
ACCEPTED MANUSCRIPT
Here, Xi belongs to the u-th class, and fu is the score of the corresponding class.
AC
CE
PT
ED
M
AN US
CR IP T
3.2. Family First Model Fine-grained plant recognition is challenging, and it often requires professional domain knowledge. Human experts use plant keys to identify the plant category. Plant keys, also called dichotomous plant keys, were first introduced by Jean Baptiste Lamarck in 1778 [35], and offer a great help to botanists for the plant identification. Each key contains several dichotomous features, which can be interpreted as a binary search tree from a computer science point of view. However, such hand-design features are difficult to be implemented in computer vision systems. Alternatively, we resort to the plant taxonomy as a hierarchical categorization including, from the top to bottom, kingdom, division, class, series, family, genus, and species. In the view of botanists and plant keys, family is the most discriminative feature container. That is, for human experts, when it comes to an unknown plant, they tell which family the plant is first, and then identify the species. This divide-and-conquer strategy lowers the difficulty of plant recognition, which is our primary rationale of designing the family first model. To implement the family first model, one intuitive approach is to use a tree-based structure like CNN Tree [36]. Instead of seeking for the confusion set, we can branch the tree according to the family labels. However, these CNN data structures consume a huge amount of memory. For instance, Flower 102 [29] Dataset has 102 species, which belong to 47 families meaning we need 48 CNNs for classifying a single dataset! Therefore, we need to find a way to reduced the memory footprint. To this end, we employ the multitask learning [37]. Specifically, we take the plant family labels as another objective for the target network to learn. There are 4 approaches based on the multi-task learning framework, illustrated in Figures 3 and 4. The first approach is to train two target networks separately: one predicts family labels, and the other one predicts species labels (shown in Figure 3a). We refer this strategy as separate training or ST, and its final prediction score is computed as: result = max αpfi + psi , (3) where pfi is the converted family softmax score, and psi is species softmax score. α is family weight rate. Note that it is invalid to sum a family score 6
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
(a) Separate training
(b) Joint training
(c) Sequential training
joint
AC
CE
PT
Figure 3: An overview of the family first models. “conv” stands for the convolutional layers; “fc” stands for fully connected layers. Separate training means to train two networks separately, and fusion in the test time. Joint training means to train two labels simultaneously. Sequential joint training does not only train two labels at the same time but also outputs family labels in the earlier stage.
7
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
with a species score that does not belong to this family. In order to further reduce the number of CNNs, we deploy two separate fully connected layers as before the network output, which is called joint training (Figure 3b or JT). In this case, we only need one CNN with two softmax outputs. In the forward propagation, the target network predicts both species and family labels, and in back propagation step, it optimizes the network parameters W by minimizing the categorical cross-entropy loss of Eq. (1) of both family and species labels. Under this setting, two learning tasks can help each other because some features that can be easily learnt for family prediction may be difficult to learn for species prediction. One extra supervised label may also prevent the network from overfitting. The third approach is inspired by GoogLeNet [38], which is called sequential joint training (Figure 3c, or SJT). Under this setting, we output family labels first as an auxiliary output of the network, and predict the species labels at the last layer of the network. This approach is reasonable because family is a more coarse-grained label than the species labels, which needs fewer parameters to be classified. However, all of these three methods have a common drawback: the network may output family labels yf and species labels ys such that ys ∈ / yf . In other words, the family label and species label may not match each other. For instance, the output of ys may be the rose while yf is not Rosaceae. This is because the parent-children relation between the family and species labels is discarded during the CNN training. Therefore, we need to restore such implicit connection between family and its corresponding species labels, which is handled using the implicit tree model (IT). An overview of the IT model is shown in Figure 4. This network structure has a single output yˆf for the family label and n output yˆsn for species labels, where n is the number of families. For each of the n outputs, it only contains species which belong to its family. On the forward propagation step, it first picks the family label with highest prediction score, and then picks the best species label. On back propagation step, the softmax layer that predicts family labels uses the standard cross-entropy loss of Eq. (1), while the rest of output layers use the a modified loss such that: Li = λ
m X i=1
− log (pi ) ,
8
(4)
AN US
CR IP T
ACCEPTED MANUSCRIPT
M
Figure 4: An overview of the implicit tree model. It has n + 1 outputs, where n is the number of plant family. For each of the n outputs, it has m dimensions, which is the number of species belongs to that family.
ED
where we call λ the implicit tree multiplier defined as: ( Pk yj = 1 0, Pj=1 . λ= k 1, j=1 yj = 0
(5)
CE
PT
That is, when the ground truth labels belong to its family, we have the standard cross-entropy loss; otherwise, the loss will be zero meaning we simply ignore the output. In addition, we also balance the total number of training data in the training set so that each family has roughly equal numbers of samples. This is done by dividing the a huge family (i.e. a plant family with a large number of samples) into sub-families based on its genus labels.
AC
3.3. Max-sum Attention Model Unlike specimen pictures, in real-world applications, a plant picture does not only have the visual information of the target plant, but also contains noisy pixels coming from its living environment (i.e. Figure 2 right column). Those unrelated information could seriously downgrade the prediction accuracy To address this problem, we propose a max-sum attention model as 9
CR IP T
ACCEPTED MANUSCRIPT
Figure 5: An overview of the max-sum attention model pipeline, which consists of two steps: attention heatmap generation (green arrow) and max-sum part localization (yellow arrow).
AN US
shown in Figure 5. This model is quite intuitive and consists of two steps. We first obtain an attention heatmap and extract the plant of interest out of the original input.
ED
M
3.3.1. Attention heat map generation First, we train a network with the pre-trained parameter of the ImageNet [1]. Then, we utilize the fully convolutional network [39, 40] to get the attention heat map. Specifically, we modify the target network by replacing the fully connected layers with 1 × 1 convolution layer but keep the original parameter. Also, we remove the last pooling layer of the target network. We take the average of attention heat map of the top 5 prediction results as the final output. Formally, the network output fo for each input image Xi is computed as follows:
PT
fo (Xi ) = sof tmax(f c(poolingl (convn (Xi )))),
(6)
AC
CE
where convn stands for all the convolutional, pooling and activation layers before the last polling layer. poolingl is the last pooling layer. f c is the full connected layers, and sof tmax is the softmax activation. With our modification, the computation can be expressed as follows: fm (Xio ) = HM merge(sof tmax(conv1×1 (convn (Xio )))),
(7)
where Xio implies that we keep original size of the i-th input image. conv1×1 stands for the one-by-one convolution layer. Note that the softmax layer here should be able to process 2D heatmap other than 1D probability in our implementation, and the HM merge operation is:
10
AN US
CR IP T
ACCEPTED MANUSCRIPT
M
Figure 6: An overview of the max-sum attention model. First, we train the upper stream, and convert the network to the middle stream, where all the 1 × 1 convolutional layers share the same parameters with the fc layers. Finally, follow the lower stream the attention heatmap can be generated which will be used to extract the target plant of interest.
ids = idmax5 (fo (Xi )),
out = meanids (fm (Xio )).
(8)
PT
ED
That is to say we find the indexes of top 5 predictions from the original network by idmax5 () function first. Then we take the average attention heap at that index as the final output. Because parameters of fm are all inherited from fo , we do not train fm again. Other implementation details are discussed in Section 4 in detail.
AC
CE
3.3.2. Max-sum part localization To localize the discriminative part of the input image, we can either use the static localization or the dynamic localization. We take the output matrix from the previous step as input and output a 4 variable tuple (a1 , a2 , b1 , b2 ) which denote the coordinate of extracted subregion. Finally, we map the coordinate to the original input image and crop the corresponding part of the image. Static localization Static localization finds a max-sum submatrix with static size. We implement it using a dynamic programming strategy [41] with O(n2 ) time complexity. The advantage of this method is that it can be 11
ACCEPTED MANUSCRIPT
computed efficiently and has a stable output. However, it needs to set the crop rate r manually, which indicates the percentage of the original image to be extracted.
M
AN US
CR IP T
Dynamic localization Dynamic localization finds a max-sum submatrix with varying sizes. First, we preprocess the input matrix M by M ← M − β(median(M ) + mean(M )). Here median() returns the median value of the matrix, and mean() calculates the average of all the matrix elements. β is a hyperparameter related to cropped size. In most cases, we set it to 0.3. After that, we use the Fast Fourier transform (FFT) based method proposed by Tristan Hearn [42] with O(n3 log n) time complexity. After identifying the discriminative part of the input image, we feed it to another network and train the network again. For a static size localization input, we denote it as “static” and for a dynamic size localization input we denoted it as “dynamic”. It can also be a multi-input network that takes the original image (denoted as “original”) and dynamic size localization, or different input combination like “original” + “static” and “static” + “dynamic”.
CE
PT
ED
3.4. Two-way Attention Model Since the family first model and max-sum attention model are compatible with each other, we combine them as the two-way attention model (shown in Figure 7). We train the first network with both family and species labels, then use it to generate the attention heat map. After cropping the image, we input them to the second network for training. In the test phase, we use the first network for localization and use the output of the second network as the final result. 4. Experiments
AC
We conducted experiments on four datasets: Malayakew [15], ICL [43], Flowers 102 [29] and CFH plant [44]. The first two are for specimen recognitions, and others are for real-environment recognitions. We use Xception [34] as our backbone network, which achieves top-1 validation accuracy of 0.790 and a top-5 validation accuracy of 0.945 on ImageNet [1]. Our implementation is based on Keras [45] library with Tensorflow [46] backend.
12
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
AC
CE
Figure 7: An overview of the pipeline of the proposed two-way attention model. There are two streams of the backbone networks. Each network outputs both family labels and species labels. The first network is also responsible for generating heatmaps which guide the image cropping.
13
ACCEPTED MANUSCRIPT
Table 1: The statistics of datasets used in this paper
# Classes 44 220 102 107
# Training 2,288 13,535 2,040 11,914
# Testing 528 3,316 6,149 2,906
CR IP T
Datasets Malayakew[15] ICL[43] Flower 102[29] CFH Plant
AN US
4.1. Parameter settings In our experiments, we use 8 times rotation data augmentation strategy, that rotates each the training image for every 45◦ from 0◦ to 360◦ . We use RMSprop [47] for the optimization and set the learning rate to 0.0001 and decay rate to 0.0007 per epoch. We re-scale the input the pixels value to be in the range of [0, 255] for a better classification and part localization. Table 1 summaries the number of training and testing images.
PT
ED
M
4.2. Implementation details Xception [34] has 131 layers before the last pooling layer. We keep these layers unchanged in our implementation and set them as the backbone network. The detailed implementation is shown in Figure 8. The network at left is for training and prediction of both original and cropped images, and the network at right is for attention heatmap generation. The left net takes fix size images as the input and outputs both family and species labels. The right net takes the original size of the input images and outputs the attention heatmap. Both networks share the same parameters.
AC
CE
4.3. Malayakew and ICL dataset Malayakew [15] and ICL [43] datasets are both specimen recognition datasets. With the advance of CNNs, the specimen recognition becomes a relatively easy task, and the baseline model can achieve the accuracy of 99%. So we only report the baseline experiment result without our optimization. The results are shown in Table 2 and Table 3. Note that due to the lack of definition of standard train/test split protocol on ICL [43] dataset, we randomly select 80% of image samges for training, and the remaining 20% is for testing. Also, we did not use any data augmentation strategy for training on ICL [43] dataset. For an objective comparison with existing works like [48], we also test the result with train/test rate of 90%/10%. 14
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
ED
Figure 8: An overview of the implementation details of our Two-way attention CNN with Xception backbone based on flower 102 setting. “None” here means it can be any sizes depending on the input. Table 2: Comparisons of classification accuracy with other methods on Malayakew dataset.
CE
PT
Methods Fine-tuned AlexNet[15] Fine-tuned Xception(ours)
Acc (%) 99.5 99.8
Table 3: Comparisons of classification accuracy with other methods on the ICL dataset. Our method yields better results even with more classes and less training data.
AC
Methods MDM-CD-C[43] I-IDSC[49] PID-DBNs[48] Fine-tuned Xception(Ours) Fine-tuned Xception(Ours)
Classes 50 50 220 220 220 15
Train/test rate 96.6% / 3.3% 96.6% / 3.3% 90% / 10% 80% / 20% 90% / 10%
Acc (%) 74.2 92.2 93.9 99.7 99.9
ACCEPTED MANUSCRIPT
Table 4: Test accuracy of TA-CNNs on Flower 102 dataset.
96.1 96.4 96.5 96.2
AN US
ST JT SJT IT
Methods Acc (%) Static 0.7 96.3 Dynamic 96.4 Dynamic + Original 96.9 Dynamic + Static 0.7 96.9 Dynamic + Static 0.8 97.0 Dynamic + JT 96.2 Static 0.7 + JT 96.5 Dynamic + Original + JT 96.9 Dynamic + Static 0.8 + JT 97.0 Dynamic + Original + SJT 97.1 Dynamic + Static 0.8 + SJT 97.2
CR IP T
Methods Acc (%) Baseline 96.0
AC
CE
PT
ED
M
4.4. Flower 102 dataset Flower 102 [29] is a real-environment based dataset, which consists of 102 flower classes with totally 8, 189 images. According to the standard split protocol, there are 2, 040 images for training and 6, 149 images for testing. Under this setting, the number of the training set is less than the test set, which makes this task more challenging. The result and comparisons with other methods are shown in Table 4 and Table 5. The baseline test accuracy of Xception is 96.0%, which is a not-bad result. Yet our model is able to further improve the prediction accuracy. For family first models, separate training has 0.1% improvement over the baseline model. JT and SJT have better results with 0.4% and 0.5% improvement, respectively. For the max-sum attention Models, dynamic crop and static crop with r = 0.7 have 0.4% and 0.3% improvement over the baseline model. Also, with proper combinations of original images, dynamic crop image, and static crop image, we can achieve the accuracy of 97.0%. What’s more, since the family first model and max-sum attention Model are compatible with each other, we obtain the 97.2% test accuracy when combining both, which outperforms all the existing methods. 4.5. CFH plant dataset CFH plant [44] is an online real-environment based dataset which contains more than 6 million labeled plant image belong to more than 10 thousand classes. To evaluate our model, we downloaded 14, 820 images within 107 16
ACCEPTED MANUSCRIPT
Table 5: Comparisons of classification accuracy and total parameters need to compute with other methods on Flower 102 dataset.
Total Parameters 139988134 269356876 46025164
CR IP T
Methods Acc (%) Detection and Segmentation[50] 80.7 GMP[51] 84.6 CNN feature[52] 86.8 Neural Activation Constellations[53] 95.3 LG-CNN[28] 96.6 TA-CNN(ours) 97.2
AN US
Table 6: Accuracy and standard deviations of 10-fold cross-validation of TA-CNNs on CFH Plant dataset.
Methods Baseline
Acc and Std (%) 75.6 ± 0.5
PT
ED
M
Dynamic + Original Dynamic + Static 0.7 Static 0.7 + Original Dynamic + Original + JT Dynamic + Static 0.7 + JT Static 0.7 + Original + JT Dynamic + Original + SJT Dynamic + Static 0.7 + SJT Static 0.7 + Original + SJT
79.2 ± 1.0 77.1 ± 0.4 78.6 ± 0.4 79.5 ± 0.9 77.6 ± 0.9 79.4 ± 0.6 79.2 ± 0.8 77.6 ± 0.7 79.4 ± 0.8
AC
CE
most common plant classes in China. Due to a large amount of data and the limited of computation resources, we did not use any data augmentation method on this dataset. There also exist some random noise in the dataset (shown in Figure 9). Since there isn’t a standard train/test split protocol, the 10-fold cross validation is used in this dataset. The results are reported in Table 6. The baseline test accuracy of Xception is 75.6% because of the complex backgrounds. Our models has a better prediction accuracy, which is about 2% to 4% higher than the baseline model with the best accuracy of 79.5%, which is achieved with the combination of “Dynamic + Original + JT”.
17
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
AC
CE
Figure 9: Sample images in the CFH plant dataset. Note that there is some random noise in this dataset.
18
CR IP T
ACCEPTED MANUSCRIPT
AN US
Figure 10: The max-sum localization result with dynamic size. The upper row is the original image; the middle row is the localization without the family labels; the rest are the localization with the family labels.
5. Discussion
PT
ED
M
5.1. Influence of the family labels on max-sum localization As we know, the family labels are defined by botanist, which contains prior knowledge in botany. This serves a good guide for us to find more relevant plant features. Without this information, we can only use the CNNs to find plant features by brute force. As shown in Figure 10, both localization with and without the family labels have good performance. In most cases when training without the family labels, the network only pays attention to the flower. However, when training with the family labels, the network also pays attention to the leaves. It can be concluded that with prior knowledge of the family labels, the CNN can better identify biological features of the plants other than more intuitive visual features, which lead to a better performance in classification tasks.
CE
6. Conclusion
AC
In this paper, we propose a novel method for plant recognition. For specimen recognition task, we report a successful application of deep convolutional neural network with test accuracy surpassing 99%. For real-environment recognition, we propose the Two-way Attention Models for further optimization in both discriminative feature learning and part based attention. For discriminative feature learning, we propose the family first models and introduce the plant family labels as another objective to learn. For part based attention, we propose the max-sum attention models to find the discriminative part of the input image. Experiments on four datasets (Malayakew, ICL, 19
ACCEPTED MANUSCRIPT
CR IP T
Flowers 102 and CFH plant) demonstrate the effectiveness of our methods. In the future, we will consider converting our models to end-to-end model and further improve the test accuracy as well as reduce the time and space complexity. 7. Acknowledgements
AN US
This study was supported by the National Key Research and Development Plan of China (2016YFD0600101), Jiangsu Provincial Department of Housing and Urban-Rural Development (2016ZD44), in part by the Practice Innovation Training Program Projects for Jiangsu College Students under Grant 201710298029Z and 201810298052Z. References
[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database, in: CVPR09, 2009.
M
[2] J. C. Neto, G. E. Meyer, D. D. Jones, A. K. Samal, Plant species identification using elliptic fourier leaf shape analysis, Computers and electronics in agriculture 50 (2) (2006) 121–134.
ED
[3] A. Aakif, M. F. Khan, Automatic classification of plants based on their leaves, Biosystems Engineering 139 (2015) 66–75.
CE
PT
[4] S. Mouine, I. Yahiaoui, A. Verroust-Blondet, Advanced shape context for plant species identification using leaf image retrieval, in: Proceedings of the 2nd ACM international conference on multimedia retrieval, ACM, 2012, p. 49.
AC
[5] D. Hall, C. McCool, F. Dayoub, N. Sunderhauf, B. Upcroft, Evaluation of features for leaf classification in challenging conditions, in: Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on, IEEE, 2015, pp. 797–804. [6] M. Rashad, B. El-Desouky, M. S. Khawasik, Plants images classification based on textural features using combined classifier, International Journal of Computer Science and Information Technology 3 (4) (2011) 93–100. 20
ACCEPTED MANUSCRIPT
CR IP T
[7] A. Olsen, S. Han, B. Calvert, P. Ridd, O. Kenny, In situ leaf classification using histograms of oriented gradients, in: Digital Image Computing: Techniques and Applications (DICTA), 2015 International Conference on, IEEE, 2015, pp. 1–8. [8] Y. Naresh, H. Nagendraswamy, Classification of medicinal plants: an approach using modified lbp with symbolic representation, Neurocomputing 173 (2016) 1789–1797.
AN US
[9] Z. Tang, Y. Su, M. J. Er, F. Qi, L. Zhang, J. Zhou, A local binary pattern based texture descriptors for classification of tea leaves, Neurocomputing 168 (2015) 1011–1023. [10] J. Charters, Z. Wang, Z. Chi, A. C. Tsoi, D. D. Feng, Eagle: a novel descriptor for identifying plant species using leaf lamina vascular features, in: Multimedia and Expo Workshops (ICMEW), 2014 IEEE International Conference on, IEEE, 2014, pp. 1–6.
M
[11] M. G. Larese, R. Nam´ıas, R. M. Craviotto, M. R. Arango, C. Gallo, P. M. Granitto, Automatic classification of legumes using leaf vein image features, Pattern Recognition 47 (1) (2014) 158–168.
ED
[12] J. Chaki, R. Parekh, S. Bhattacharya, Plant leaf recognition using texture and shape features with neural classifiers, Pattern Recognition Letters 58 (2015) 61–68.
PT
[13] T. Beghin, J. S. Cope, P. Remagnino, S. Barman, Shape and texture based plant leaf classification, in: International Conference on Advanced Concepts for Intelligent Vision Systems, Springer, 2010, pp. 345–353.
CE
[14] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
AC
[15] S. H. Lee, C. S. Chan, S. J. Mayo, P. Remagnino, How deep learning extracts and learns leaf features for plant classification, Pattern Recognition 71 (2017) 1–13. [16] G. L. Grinblat, L. C. Uzal, M. G. Larese, P. M. Granitto, Deep learning for plant identification using vein morphological patterns, Computers and Electronics in Agriculture 127 (2016) 418–424. 21
ACCEPTED MANUSCRIPT
[17] P. Barr´e, B. C. St¨over, K. F. M¨ uller, V. Steinhage, Leafnet: A computer vision system for automatic plant species identification, Ecological Informatics.
CR IP T
[18] T.-Y. Lin, A. RoyChowdhury, S. Maji, Bilinear cnn models for finegrained visual recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1449–1457. [19] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, P. Perona, Caltech-ucsd birds 200.
AN US
[20] X. Zhang, H. Xiong, W. Zhou, W. Lin, Q. Tian, Picking deep filter responses for fine-grained image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1134–1142. [21] F. Perronnin, D. Larlus, Fisher vectors meet neural networks: A hybrid classification architecture, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3743–3752.
ED
M
[22] A. Khosla, N. Jayadevaprakash, B. Yao, F.-F. Li, Novel dataset for finegrained image categorization: Stanford dogs, in: Proc. CVPR Workshop on Fine-Grained Visual Categorization (FGVC), Vol. 2, 2011, p. 1.
PT
[23] F. Zhou, Y. Lin, Fine-grained image classification by exploring bipartitegraph labels, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1124–1133.
CE
[24] D. Lin, X. Shen, C. Lu, J. Jia, Deep lac: Deep localization, alignment and classification for fine-grained recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1666–1674.
AC
[25] H. Zhang, T. Xu, M. Elhoseiny, X. Huang, S. Zhang, A. Elgammal, D. Metaxas, Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1143–1152. [26] N. Zhang, J. Donahue, R. Girshick, T. Darrell, Part-based r-cnns for fine-grained category detection, in: European conference on computer vision, Springer, 2014, pp. 834–849. 22
ACCEPTED MANUSCRIPT
[27] X. Liu, T. Xia, J. Wang, Y. Lin, Fully convolutional attention localization networks: Efficient attention localization for fine-grained recognition, arXiv preprint arXiv:1603.06765.
CR IP T
[28] G.-S. Xie, X.-Y. Zhang, W. Yang, M.-L. Xu, S. Yan, C.-L. Liu, Lg-cnn: From local parts to global discrimination for fine-grained recognition, Pattern Recognition.
AN US
[29] M.-E. Nilsback, A. Zisserman, Automated flower classification over a large number of classes, in: Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on, IEEE, 2008, pp. 722–729. [30] J. Fu, H. Zheng, T. Mei, Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition, in: Conf. on Computer Vision and Pattern Recognition, 2017.
M
[31] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
ED
[32] K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, arXiv preprint arXiv:1409.1556. [33] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, arXiv preprint arXiv:1512.03385.
PT
[34] F. Chollet, Xception: Deep learning with depthwise separable convolutions, arXiv preprint arXiv:1610.02357.
CE
[35] L. R. Griffing, Who invented the dichotomous key? richard waller’s watercolors of the herbs of britain., American Journal of Botany 98 (12) (2011) 1911–23.
AC
[36] Z. Wang, X. Wang, G. Wang, Learning fine-grained features via a cnn tree for large-scale classification, Neurocomputing. [37] R. Caruana, Multitask learning, in: Learning to learn, Springer, 1998, pp. 95–133.
23
ACCEPTED MANUSCRIPT
CR IP T
[38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9. [39] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat: Integrated recognition, localization and detection using convolutional networks, arXiv preprint arXiv:1312.6229.
AN US
[40] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. [41] R. Bellman, The theory of dynamic programming, Tech. rep., RAND CORP SANTA MONICA CA (1954). [42] T. Hearn, Maximum submatrix sum, https://github.com/thearn/maximum-submatrix-sum (2013).
M
[43] R. Hu, W. Jia, H. Ling, D. Huang, Multiscale distance matrix for fast plant leaf recognition, IEEE transactions on image processing 21 (11) (2012) 4667–4672.
ED
[44] B. Chen, Nature-museum biodiversity information http://www.cfh.ac.cn/default-en.html (2012).
system,
PT
[45] F. Chollet, et al., Keras, https://github.com/fchollet/keras (2015).
AC
CE
[46] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, software available from tensorflow.org (2015). URL https://www.tensorflow.org/
24
ACCEPTED MANUSCRIPT
[47] T. Tieleman, G. Hinton, Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning (2012).
CR IP T
[48] N. Liu, J.-m. Kan, Improved deep belief networks and multi-feature fusion for leaf identification, Neurocomputing 216 (2016) 460–467. [49] C. Zhao, S. S. Chan, W.-K. Cham, L. Chu, Plant identification using leaf shapesa pattern counting approach, Pattern Recognition 48 (10) (2015) 3203–3215.
AN US
[50] A. Angelova, S. Zhu, Efficient object detection and segmentation for fine-grained recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 811–818. [51] N. Murray, F. Perronnin, Generalized max pooling, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2473–2480.
ED
M
[52] A. Sharif Razavian, H. Azizpour, J. Sullivan, S. Carlsson, Cnn features off-the-shelf: an astounding baseline for recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2014, pp. 806–813.
AC
CE
PT
[53] M. Simon, E. Rodner, Neural activation constellations: Unsupervised part model discovery with convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1143– 1151.
25
ACCEPTED MANUSCRIPT
CR IP T
Youxiang Zhu is currently an undergraduate student at College of Information Science and Technology, Nanjing Forestry University, China. His research interests include deep learning, computer vision, and data mining.
AN US
Weiming Sun is currently an undergraduate student at College of Information Science and Technology, Nanjing Forestry University, China. His research interests include deep learning, and computer vision.
AC
CE
PT
ED
M
Xiangying Cao is currently an undergraduate student at College of Information Science and Technology, Nanjing Forestry University, China. Her research interests include deep learning, and natural language processing.
Chunyan Wang was born in 1994. She is currently pursuing the master’s degree in the college of Computer Science and Technology, Nanjing Forestry University, Jiangsu, China. Her main research interests include pattern recognition, machine learning, data mining, and image processing.
ACCEPTED MANUSCRIPT
CR IP T
Dongyang Wu received the master’s degree from Nanjing Forestry University, China, in 2011. She is currently a fourth year Ph.D. student in bioinformatics at Nanjing Forestry University. Her research interests include machine learning, bioinformatics.
AN US
Yin Yang received his Ph.D. degree in computer science from the University of Texas at Dallas in 2013. He is an Assistant Professor in Department of Electrical Computer Engineering at the University of New Mexico, Albuquerque. His research interests include physics-based animation/simulation and related applications, scientific visualization and medical imaging analysis
AC
CE
PT
ED
M
Ning Ye received M.S. degree in Test Measurement Technology and Instruments from Nanjing University of Aeronautics and Astronautics, China in 2006, he completed his Ph.D. degree in Computer Application Technology from Southeast University, China. He is full-time professor at the School of Information technology of Nanjing Forestry University, Nanjing, China. His research interests include machine learning, bioinformatics, and data mining.
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Conflict of Interest Form