Scientia Horticulturae 228 (2018) 187–195
Contents lists available at ScienceDirect
Scientia Horticulturae journal homepage: www.elsevier.com/locate/scihorti
Research Paper
A comparative evaluation of combined feature detectors and descriptors in different color spaces for stereo image matching of tree
MARK
⁎
Ayoub Jafari Malekabadi, Mehdi Khojastehpour , Bagher Emadi Department of Biosystems Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
A R T I C L E I N F O
A B S T R A C T
Keywords: Tree Depth map Image matching Feature detector Feature descriptors Color spaces
Tree canopy geometric characteristics are directly related to tree growth and productivity, and this information has been used to predict yield, fertilizer application in citrus crops, water consumption or biomass. So a 3D model and a depth map of the tree can be useful. One method of creating 3D model is stereo vision technique. The comparison and performance assessment of different combinations of feature detectors and descriptors is very important in this technique. In this study, the performance of 12 combinations of well-known detectors and descriptors including BRISK with SURF, BRISK with BRISK, BRISK with FREAK, Harris with SURF, Harris with BRISK, Harris with FREAK, SURF with SURF, SURF with BRISK, SURF with FREAK, MSER with SURF, MSER with BRISK, and MSER with FREAK was evaluated. The included color spaces were HSV, H, YCbCr, Y, NTSC and RGB. The performance comparison for each combination was carried out in terms of precision and recall values using stereo image pairs of tree. It was observed that the largest number of keypoints were detected by MSER and SURF detectors almost in all possible spaces. Best combinations used from SURF detector and descriptor when the precision and recall results were considered. RGB and Y spaces were the best spaces to implement combinations. The combination of SURF with SURF, SURF with FREAK, and HARRIS-SURF were found to be preferable.
1. Introduction The structural aspects of a canopy are crucial at different levels (individual tree, crops, forest and ecosystems) (Phattaralerphong and Sinoquet, 2005). Most of the research conducted to date are related to forest areas (Lefsky et al., 2002; Parker et al., 2004; Maas et al., 2008; Kushida et al., 2009). However, in the field of agriculture, obtaining three-dimensional (3D) models of trees and plantations introduces an immense and novel field of applications. The geometric characterization of trees are both relevant and complex tasks (Sanz-Cortiella et al., 2011a,b). Canopy characteristics supply valuable information for tree management reducing production costs and public concerns about environmental pollution. Thus, there is a whole range of key agricultural activities including pesticide treatments, irrigation, fertilization and crop training which depend largely on the structural and geometric properties of the visible part of trees (Rosell and Sanz, 2012). At present, the research groups are conducting research to a variety of non-destructive techniques for the measurement of the tree canopy structural characteristics such as volume, foliage and leaf area index. This can be achieved by different detection approaches. The use of ultrasonic sensors (Giles et al., 1988; Zaman and Salyani, 2004; Zaman
⁎
and Schumann, 2005; Solanelles et al., 2006), as well as digital photographs (Phattaralerphong and Sinoquet, 2005; Leblanc et al., 2005), laser sensors (Naesset, 1997a,b; Aschoff et al., 2004; Van der Zande et al., 2006; Rosell et al., 2009a,b), stereo images (Andersen et al., 2005; Rovira-Mas et al., 2005; Kise and Zhang, 2006), light sensors (Giuliani et al., 2000), high-resolution radar images (Bongers, 2001) or high-resolution X-ray computed tomography (Stuppy et al., 2003) offer innovative solutions to the problem of structural assessment in trees (Rosell and Sanz, 2012; Jafari Malekabadi et al., 2016). For example, Giles et al. (1987, 1988, 1989a,b) discussed the use of the ultrasonic sensors to measure canopy volume in peach and apple trees. They developed using this technique to improve the process of pesticide application. The measurement system was based on a three ultrasonic sensors mounted at different heights of an air-blast orchard sprayer. The results showed pesticide saving up to 52% in apples. Despite noted researches, little research has been done on tree modeling by machine vision. In particular, vision-based measurement methods are nondestructive and an effective way to determine external plant features (Yeh et al., 2014). Stereo vision is a method for the extraction of 3D information from digital images in this field. The most important problem is the corresponding points and matching them in
Corresponding author. E-mail address:
[email protected] (M. Khojastehpour).
http://dx.doi.org/10.1016/j.scienta.2017.10.030 Received 18 June 2017; Received in revised form 13 October 2017; Accepted 16 October 2017 0304-4238/ © 2017 Elsevier B.V. All rights reserved.
Scientia Horticulturae 228 (2018) 187–195
A. Jafari Malekabadi et al.
Table 1 Properties of images. Property Value
Digital zoom 0
Flash mode No flash
35 mm focal length 24
Focal length (mm) 4
•
Dimensions (Pixels) 2447 × 3264
2. Materials and methods 2.1. Stereo vision system and image acquisition To calculate tree volume (Immersion method) and to verify results of stereo vision system, an artifact tree (cherry) was made with dimensions of 50 × 70 cm in conical shape. The artifact tree was positioned in imaging room which is equipped with a controlled lighting system and the capability for installing cameras at various heights and angles. The pair of images were acquired using camera (Nikon COOLPIX P510) for both right and left sides. Camera was set to manual mode according to Table 1. Camera was manually placed at two varying viewpoints: with fix height (along y), with 50 cm baseline (along x), with distance of 100 cm from tree (along z) and with parallel optical axes.
2.2. Camera calibration It is an important pre-step in correctly matching the stereo images and in precisely computing the depth in the stereovision system. It is the process of estimating intrinsic and extrinsic parameters of the camera to minimize the discrepancy between the observed image features and their theoretical positions in the camera pinhole model. Camera calibration in the present study was done based on Heikkila and Silven (1997) and MATLAB camera calibration toolbox (Bouguet, 2004).
• Fertilization: Making a geometric characterization during the pro-
•
Resolution (dpi) 300
precision and recall metrics are used by considering the relation between correct matches with number of keypoints in the reference image and the reference image that have been matched.
this method. The well-known detectors and descriptors that receiving most citations are ORB, SIFT, SURF, BRISK, FREAK, HARRIS, FAST and MSER (Llorens et al., 2011). It is impossible to obtain best accuracy as well as best security in a minimum computation-time at the same time. Therefore, we have to make concessions for the sake of selection an optimal feature detection method with respect to the task performed. Some feature detection methods are compared in several studies. Six feature descriptors have chosen to make a comparison: SURF, ORB, BRIEF, BRISK, SIFT and SU-BRISK by Peng (2012). Also, a comparative analysis of three binary descriptors (ORB, BRIEF and BRISK) by concentrating on well-known detectors (ORB, MSER, SIFT, SURF, FAST and BRISK) is carried out in terms of effects of various geometric and photometric transformations (Heinly et al., 2012). In a study on comparison of low level feature extraction algorithms (El-gayar and Soliman. 2013), the performance of FAST-SIFT (F-SIFT) feature detection methods have compared in case of blur, illumination and scale changes, rotation and affine transformations. In another study the comparison analysis between SIFT and traditional photogrammetric feature extraction methods and matching metrics in Photogrammetry (Lingua et al., 2009) is carried out by performing experimental tests on images acquired by Unmanned Aerial Vehicles (UAV) and Mobile Mapping Technologies (MMT) with geometric distortions. Again, the performance of keypoint detectors (FREAK vs. SURF vs. BRISK) are examined in the context of pedestrian detection (Schaeffer, 2013). Işık and Ozkan (2015) evaluated the performance of seven combinations of well-known detectors and descriptors (SIFT, SURF, MSER, BRISK, FREAK, ORB, BRIEF and FAST). Although many studies have been done on comparison feature detection methods, there is not universally published performance evaluation for stereo image pairs matching of tree. It is necessary to obtain a 3D map of tree to measure geometric characteristics for use in the following items:
•
F-stop f/7.4
2.3. Un-distortion
ductive cycle of trees provides an important part of information required for programming fertilization according to the Nutrient Budgets method (Muhammad et al., 2009; Saa et al., 2013). Crop training: Tree canopy is important in order to provide proper training and pruning for trees. The amount of light intercepted by a tree depends on tree density, orientation, size, tree shape and leaf area index (Buba, 2015; Stephan et al., 2008). Pest and disease control: The geometric characterization of trees provides the fundamental data that can be used to minimize the environmental impact of pesticides (Russell, 2004). Irrigation: Irrigation studies in tree crops are limited by the absence of proper tools for the geometric characterization of vegetation. Therefore, researchers use variables to represent the size and structure of vegetation in some way (Rosell and Sanz, 2012). A precise geometrical characterization of crops at any point during the production cycle may help to establish precise estimations of crop water needs.
In theory, it is possible to define a lens that will introduce no distortions. In practice, however, no lens is perfect. This is mainly for reasons of manufacturing; it is much easier to make a spherical lens than to make a more mathematically ideal parabolic lens. It is also difficult to mechanically align the lens and imager exactly. Here we describe the two main lens distortions and how to model them. Radial distortions arise as a result of the shape of lens, whereas tangential distortions arise from the assembly process of the camera as a whole (Bradski and Kaehler, 2008). After obtaining the parameters of the distortion and internal camera calibration, images were undistorted.
2.4. Rectification For a given point in one image, its correspondent point along an epipolar line in the other image has to be searched (Zhang, 1998). Generally, epipolar lines are not aligned with the coordinate axis and not parallel. Such searches are time consuming since we must compare pixels on skew lines in the image space. These types of algorithm can be simplified and made more efficient if epipolar lines are axis aligned and parallel so that epipolar lines in the original images map to horizontally aligned lines in the transformed images. This can be realized by applying 2D projective transforms to each image. This process is known as image rectification (Loop and Zhang, 1999). Calibrating the stereo system leads to a simpler technique for rectification.
In this study, well-known detectors and descriptors in different color spaces were compared. Color spaces are included HSV, H, YCbCr, Y, NTSC and RGB. Twelve combinations of detectors and descriptors which are included: BRISK with SURF (B-F), BRISK with BRISK (B-B), BRISK with FREAK (B-F), Harris with SURF (H-S), Harris with BRISK (H-B), Harris with FREAK (H-F), SURF with SURF (S-S), SURF with BRISK (S-B), SURF with FREAK (S-F), MSER with SURF (M-S), MSER with BRISK (M-B), MSER with FREAK (M-F). For the evaluation, the 188
Scientia Horticulturae 228 (2018) 187–195
A. Jafari Malekabadi et al.
challenge, several binary descriptors computed directly on image patches which BRISK (Binary Robust Invariant Scalable Keypoints) is one of them. BRISK is based on FAST (features from accelerated segment test) detector. In general, BRISK consists of three parts: a sampling pattern, orientation compensation and sampling pairs. In here, taking a sampling pattern around the keypoint refers to points spread on a set of concentric circles, which are used to determine a point is whether a corner or not in FAST detector. Then these pairs are separated two subsets, short-distance pairs and long-distance pairs. To achieve rotation invariance, the direction of each keypoint is determined by taking the sum of computed local gradient between long-distance pairs and short-distance pairs which are rotated based on obtained orientations. Finally for all the pairs, the intensity values of the first and second points in the pair are compared, i.e., if the value of first point is larger than the second then output is 1, else 0. Hence, after going all 512 pairs, led to a descriptor with 512 bits in length. In matching case, the Hamming distance is used instead of Euclidean distance. BRISK detector uses Hamming distance instead of Euclidean distance due to its short execution time (Leutenegger et al., 2011). c) MSER An MSER region is a set of all connected pixels above all thresholds and also virtually unchanged over a range threshold. In the other words, the selected regions are unchanged shapes where local binarization is stable over a large range of thresholds. The MSER detection is similar to a watershedding process. It select and an intensity threshold and divide the set of pixels into two groups black and white. It is observed that the cardinality of the two sets changes with respect to the changing the threshold from maximum to minimum intensity. The area of each connected component is stored as a function. Among the extremal regions, the “maximally stable” ones are chosen by analyzing this function for each potential region to find ones that maintain its state with similar function value over multiple thresholds. The selected “maximally stable” regions are called MSER regions that have changed in size only a little across at least several intensity threshold levels (Matas et al. (2004); Nistér and Stewénius, 2008; Obdržalek et al., 2010). d) FREAK FREAK is also a binary descriptor and borrows the procedures of sampling pattern and pair selection from BRISK. It uses a circular pattern where the density of points exponentially drops when moving away from the center and is called as retinal sampling grid that inspired by the retinal pattern in the eye. To provide rotation invariance property, an orientation for the selected patch is computed by summing the local gradients over chosen pairs which are symmetric to each other when center is considered as base. Also, in descriptor creation stage, a similar approach that was used in ORB was performed, simply the less correlated pattern was selected (Alahi et al., 2012). e) Harris Harris or Harris-Stephens corner detector family provides improvements over the Moravic method. The goal of the Harris method is to find the direction of fastest and lowest change for feature orientation, using a covariance matrix of local directional derivatives. The directional derivative values are compared with a scoring factor to identify which features are corners, which are edges, and which are likely noise. Depending on the formulation of the algorithm, the Harris method can provide high rotational invariance, limited intensity invariance, and in some of the formulations of the algorithm, scale invariance is provided such as the Harris-Laplace method using scale space (Krig, 2014).
2.5. Object and background segmentation To eliminate extra segments and to increase speed of image processing operations, the object (artifact tree) was segmented from the original image. Dimensions of object image were 1500 × 950 pixels. Segmentation of the plant’s area is commonly performed using various RGB indices, which do not always provide satisfactory results under condition of varying illumination (Tian and Slaughter, 1998; Thorp and Tian, 2004). Incorrect segmentation may lead to omission of parts of the object or inclusion of the background as part of the model, and more importantly, potential inconsistencies in the segmented plantrelated parts between the image pairs which may affect its modeling. Many methods have been proposed to remove the background (Zang and Klette, 2004; Li et al., 2004). In this study, in order to remove the background, three masks were obtained in the RGB, HSV and YCbCr spaces using the Color Thresholder App in MATLAB. Background was removed and images were segmented by applying the mask on the object image. 2.6. Color spaces and components Color spaces are different in terms of color, brightness, intensity and their composition. Algorithms that are used in machine vision are different in terms of based on these characteristics .In other study, we compared RGB, G, HSV, H, YCbCr, Y, NTSC, Lab and a spaces to evaluate feature points detection algorithms. The results showed that algorithms had the best performance in the HSV, H, YCbCr and NTSC spaces and they were stable in RGB and Y spaces in terms of the number of detected feature points. Therefore these spaces were used in this study. 2.7. Overview of feature detection methods a) SURF Due to the large amount of data in a pattern recognition task (e.g. Face Recognition) and the time consuming of SIFT (scale-invariant feature transform), Bay et al. (2008) proposed the SURF (speeded up robust features) detector inspiring by the SIFT descriptor. It is able to generate scale and rotation invariant interest points and descriptors. SURF has been used as a feature selector in many studies because of the some reasons such as descriptors generated by SURF are invariant to rotation and scaling changes. Also computational time of SURF is small and fast in compare to the other feature extraction algorithms in case of interest point localization and matching. Systematically, SURF uses 2D Haar wavelet and integral images. For keypoint detection, it uses the sum of the 2D Haar wavelet response around the point of interest. A 2D Haar wavelet is obtained by an integer approximation to the determinant of Hessian matrix that extracts blob-like structures at locations where the determinant is maximum. Therefore, the performance of SURF can be attributed to non-maximal-suppression of the determinants of the hessian matrices. In description phase, firstly the neighborhood region of each keypoint is divided into a number of 4 × 4 subsquare regions. Then, it computes the response of a 2D Haar wavelet response each sub-region. Again, this procedure can be computed with aid of the integral image. Each response contributes four values to a descriptor, so each keypoint is described with a 64-dimensional (4 × 4 × 4) feature description of all sub-regions. b) BRISK Although the local features obtained by vector-based descriptors such as SURF, SIFT and similar methods provide successful results in terms of an image representation while being invariant to many transformations, such as scale, rotation and viewpoint changes. However, using the descriptors of them is not an efficient way, especially for machines with a scarce amount of resources. They are computation power and mobile wireless devices which has a limited uplink bandwidth channel and low power requirements. To address this
2.8. Evaluation metrics Typically, several implementations for a detector or descriptor exist. The original implementations by the authors typically perform better than 3rd party implementations and hence use the original implementations whenever available. In addition, the number of features influences the retrieval performance. In general, the more features are 189
Scientia Horticulturae 228 (2018) 187–195
A. Jafari Malekabadi et al.
Table 2 Selected parameters for the detectors.
Re call =
Detectors
Parameters
Values
SURF MSER
MetricThreshold ThresholdDelta RegionAreaRange MaxAreaVariation MinQuality MinQuality MinContrast MaxRatio
8000 1 [5,50000] 0.4 0.25 0.8 0.01 0.8
correct. matches correspondences
Precision =
Harris BRISK MatchFeatures
(1)
correct. matches total. matches
(2)
3. Results and discussion 3.1. Camera calibration, image un-distortion and rectification The implementation was done in MATLAB. To do calibration, 14 images were captured from a planar checkerboard for each camera. Checkerboard dimensions were 21 × 29 cm and sizes of each square was 3 cm. Table 3 shows calibration results for each camera and stereo system. Calibration error was obtained less than 0.2 Pixel that it is suitable to implement a stereo vision system. After camera calibration and parameters calculation, images were un-distorted and rectified. Images 2 and 3 in Fig. 1 show original stereo images after un-distortion and rectification, respectively.
detected, the better the results the retrieval system produces. In our experiments, the feature extraction speed was not considered. Table 2 shows the noteworthy parameters for each detectors. All the descriptors are run with default parameters. To evaluate the performance of each method, the precision and recall values of each image that situated at different dataset were computed based on metric introduced by Mikolajczyk and Schmid (2005). The precision and recall values were found to be enough for a midlevel comparison. For a classification task, while the precision shows the number of true positives (i.e., the number of items correctly assigned into the positive class) divided by total number of items that predicted as positive (i.e., the total number of true positives and false positives), the recall shows the number of true positives divided by the total items that already labelled as positive (the sum of true positive and false negatives). Recall represents the ratio of correct matched descriptors to the number of correspondences between two images. The highest recall value shows the better performance of feature detection method and indicates the sensitivity of method. The recall value is obtained with the formula (1). Additionally, the precision refers to the ratio of correct matched descriptors to the number of total positively matched descriptors which are number of keypoints in the matched reference. The highest precision value means that how relevant the matched features to each other. The precision value is computed with the formula (2). Besides, the value of precision and recall varies with respect to the strictness of matching criteria and complexity of data. Therefore, the matching criteria is determined as balance as possible as mentioned above.
3.2. Object and background segmentation To eliminate extra segments and to increase speed of image processing operations, the object (artifact tree) was segmented from the original image. Image 4 in Fig. 1 shows stereo images after object segmentation. Therefore, masks were applied on the object images to remove background. Image 5 in Fig. 1 shows stereo images after background segmentation. 3.3. Evaluation of performance To analyze the performance, algorithms were run in different color spaces. We want to emphasize that any method were not used from other algorithms as RANSAC algorithm which employ to eliminate inconsistent matches by selecting inliers and rejecting outliers. To discuss the experimental results of evaluation, the number of detected features by detectors in left and right images (L.P and R.P), number of extracted features (Valid Points) by descriptors in left and right images (L.V.P and R.V.P), matched points (M.P), correctly matched points (C.M), precision and recall are presented in Table 4–9 and Fig. 3. Example of the correct and incorrect matches is displayed in Fig. 2. First, the
Table 3 Calibration results for right and left cameras and stereo vision system. System
Parameters
One camera
Focal length Principal point Skew Distortion
X Y X Y Radial Tangential
Stereo
Error
X Y
Focal length
X Y X Y
Principal point Skew Distortion
Radial Tangential
Rotation vector Translation vector
[0.0440 −0.0284 0.0059] ± [0.0199 0.0392 0.0081] [−589.197 -6.907 -160.162] ± [21.351 13.693 55.315]
190
Left camera (pixel)
Right camera (pixel)
2335.099 ± 8.548 2334.168 ± 8.243 1699.495 ± 7.885 1271.349 ± 9.180 0 - 0.0254 ± 0.0045 - 0.0008 ± 0.0043 0.0075 ± 0.0012 0.0098 ± 0.0009 0.163 0.163
2323.390 ± 7.310 2330.707 ± 7.030 1651.236 ± 7.127 1278.623 ± 7.771 0 0.0004 ± 0.0039 - 0.0170 ± 0.0035 0.0032 ± 0.0010 0.0060 ± 0.0008 0.193 0.176
2929.857 ± 131.234 2942.340 ± 131.092 1559.780 ± 132.018 1043.356 ± 120.004 0 0.0026 ± 0.0537 - 0.0335 ± 0.0694 0.0009 ± 0.0115 0.0130 ± 0.0110
2729.732 ± 123.600 2698.808 ± 118.654 1772.300 ± 87.986 1189.973 ± 94.267 0 - 0.0309 ± 0.0508 0.0541 ± 0.0543 0.0151 ± 0.0055 0.0270 ± 0.0103
Scientia Horticulturae 228 (2018) 187–195
A. Jafari Malekabadi et al.
Fig. 1. 1) Original Images 2) Un-distorted Images 3) Rectified Images 4) Object segmentation 5) Background segmentation.
that “precision” and “recall” must be used together to indicate the true overall performance of any method. If a method contains a high percentage of correct matches (precision is very high), they are still a very small percentage of the actual matches in a region. On the other hand, a method might contain all the possible matches, in which case its recall will be very high, but only a fraction of the matches it contains will be correct. Therefore the best result is obtained by S–S which produces the largest value of recall and its precision is the largest value after M-B and S-F. Thus S–S and S-F had the best performance in RGB space. b) HSV space Table 5 shows the performance results of various combinations in HSV space. Like results of RGB, the largest number of keypoints are detected by MSER detector (Similar to results obtained by Mikolajczyk and Schmid, 2005; Işık and Ozkan, 2015). Although various combinations extracted a lot of valid points, but many of them matched incorrectly. Therefore, Precision and recall were only calculated for M-S and M-F combinations. Combinations did not show a well performance
performance of methods in various spaces is discussed. Next, the performance of methods together were investigated. 3.3.1. Performance in color spaces a) RGB space Performance results of various combinations in RGB space are given in Table 4. It can be seen that the largest number of keypoints are detected by MSER detector (This result was similar to results obtained by Mikolajczyk and Schmid, 2005; Işık and Ozkan, 2015) and afterwards SURF detector (This result was similar to results obtained by Canclini et al., 2013). Although various combinations extracted a lot of valid points, but many of these points were matched incorrectly. Therefore precision and recall were not calculated for these combinations. Precision and recall were calculated for S-S, S-F, M-S, M-B and H-S combinations. SURF descriptor had the best performance in this space. M-B combination showed greater value on precision, but its recall value was low. So this combination isn’t a suitable method. It is important to note 191
Scientia Horticulturae 228 (2018) 187–195
A. Jafari Malekabadi et al.
Table 4 Comparison of different combinations in RGB space. Spaces
Combination
L.P
R.P
L.V.P
R.V.P
M.P
C.M
Recall
Precision
RGB
S-S S-B S-F M-S M-B M-F B-S B-B B-F H-S H-B H-F
902 902 902 960 960 960 521 521 521 228 228 228
950 950 950 1757 1757 1757 555 555 555 468 468 468
902 844 885 960 960 958 521 521 521 228 227 227
950 898 939 1757 1757 1757 555 554 554 468 468 468
194 0 78 106 1 59 94 1 16 55 1 11
11
0.0122
0.057
5 2 1 0 0 0 0 1 0 0
0.0056 0.0021 0.0010 0 0 0 0 0.0044 0 0
0.064 0.019 1.000 0 0 0 0 0.018 0 0
Note: S-S: SURF with SURF, S-B: SURF with BRISK, S-F: SURF with FREAK, M-S: MSER with SURF, M-B: MSER with BRISK, M-F: MSER with FREAK, B-S: BRISK with SURF, B-B: BRISK with BRISK, B-F: BRISK with FREAK, H-S: Harris with SURF, H-B: Harris with BRISK, H-F: Harris with FREAK. L.P and R.P: Left and Right Points, L.V.P and R.V.P: Left and Right Valid Points, M.P: Matched Points, C.M: Correct Match. Table 5 Comparison of different combinations in HSV space. Spaces
Combination
L.P
R.P
L.V.P
R.V.P
M.P
C.M
Recall
Precision
HSV
S-S S-B S-F M-S M-B M-F B-S B-B B-F H-S H-B H-F
23 23 23 4538 4538 4538 365 365 365 32 32 32
171 171 171 7516 7516 7516 401 401 401 97 97 97
23 20 22 4538 4533 4531 365 364 364 32 32 32
171 165 169 7516 7500 7499 401 400 400 97 97 97
2 0 1 193 1 160 4 1 6 0 0 1
0
0
0
0 2 0 3 0 0 0
0 0.0004 0 0.0007 0 0 0
0 0.010 0 0.019 0 0 0
0
0
0
Note: S-S: SURF with SURF, S-B: SURF with BRISK, S-F: SURF with FREAK, M-S: MSER with SURF, M-B: MSER with BRISK, M-F: MSER with FREAK, B-S: BRISK with SURF, B-B: BRISK with BRISK, B-F: BRISK with FREAK, H-S: Harris with SURF, H-B: Harris with BRISK, H-F: Harris with FREAK. L.P and R.P: Left and Right Points, L.V.P and R.V.P: Left and Right Valid Points, M.P: Matched Points, C.M: Correct Match. Table 6 Comparison of different combinations in H space. Spaces
Combination
L.P
R.P
L.V.P
R.V.P
M.P
C.M
Recall
Precision
H
S-S S-B S-F M-S M-B M-F B-S B-B B-F H-S H-B H-F
603 603 603 4890 4890 4890 626 626 626 287 287 287
644 644 644 6877 6877 6877 337 337 337 171 171 171
603 593 599 4890 4869 4873 626 621 622 287 287 287
644 615 633 6877 6858 6857 337 337 337 171 170 170
45 0 36 322 2 125 14 0 1 14 0 6
2
0.0033
0.044
1 6 2 0 0
0.0017 0.0012 0.0004 0 0
0.028 0.019 1.000 0 0
0 0
0 0
0 0
0
0
0
Note:S-S: SURF with SURF, S-B: SURF with BRISK, S-F: SURF with FREAK, M-S: MSER with SURF, M-B: MSER with BRISK, M-F: MSER with FREAK, B-S: BRISK with SURF, B-B: BRISK with BRISK, B-F: BRISK with FREAK, H-S: Harris with SURF, H-B: Harris with BRISK, H-F: Harris with FREAK. L.P and R.P: Left and Right Points, L.V.P and R.V.P: Left and Right Valid Points, M.P: Matched Points, C.M: Correct Match.
not a suitable method, as mentioned above. The best result obtained by S–S which produces the largest value of recall and precision. d) NTSC space Table 7 shows the performance results of various combinations in NTSC space. Like results of RGB, HSV and H space, the largest number of keypoints were detected by MSER detector. Various combinations extracted a few valid points and many of these points matched incorrectly. Precision and recall were only calculated for H-S combination. Combinations had not well performance in this space. e) YcbCr space
in this space. Though the best result is obtained by M-F which produces the largest value of recall and precision. c) H space Performance results of various combinations in H space are listed in Table 6. The largest number of keypoints were detected by MSER detector and this is similar to the results of RGB and HSV space. Although various combinations extracted a lot of valid points, but many points matched incorrectly. Therefore, Precision and recall were calculated for S-S, S-F, M-S and M-B combinations. M-B combination showed greater value on precision, but its recall was low value. So this combination was 192
Scientia Horticulturae 228 (2018) 187–195
A. Jafari Malekabadi et al.
Table 7 Comparison of different combinations in NTSC space. Spaces
Combination
L.P
R.P
L.V.P
R.V.P
M.P
NTSC
S-S S-B S-F M-S M-B M-F B-S B-B B-F H-S H-B H-F
2 2 2 184 184 184 166 166 166 103 103 103
6 6 6 188 188 188 187 187 187 502 502 502
2 2 2 184 184 184 166 164 165 103 103 103
6 6 6 188 188 188 187 186 186 502 502 502
0 0 0 29 0 9 34 0 15 21 0 2
C.M
Recall
Precision
0
0
0
0 0
0 0
0 0
0 1
0 0.0097
0 0.048
0
0
0
Note: S-S: SURF with SURF, S-B: SURF with BRISK, S-F: SURF with FREAK, M-S: MSER with SURF, M-B: MSER with BRISK, M-F: MSER with FREAK, B-S: BRISK with SURF, B-B: BRISK with BRISK, B-F: BRISK with FREAK, H-S: Harris with SURF, H-B: Harris with BRISK, H-F: Harris with FREAK. L.P and R.P: Left and Right Points, L.V.P and R.V.P: Left and Right Valid Points, M.P: Matched Points, C.M: Correct Match. Table 8 Comparison of different combinations in YCbCr space. Spaces
Combination
L.P
R.P
L.V.P
R.V.P
M.P
C.M
Recall
Precision
YCbCr
S-S S-B S-F M-S M-B M-F B-S B-B B-F H-S H-B H-F
0 0 0 161 161 161 231 231 231 174 174 174
0 0 0 149 149 149 236 236 236 366 366 366
161 161 161 231 228 228 174 173 173
149 149 149 236 235 235 366 366 366
22 0 18 52 1 9 51 0 9
0
0
0
1 0 0 0 0
0.0062 0 0 0 0
0.056 0 0 0 0
0
0
0
Note:S-S: SURF with SURF, S-B: SURF with BRISK, S-F: SURF with FREAK, M-S: MSER with SURF, M-B: MSER with BRISK, M-F: MSER with FREAK, B-S: BRISK with SURF, B-B: BRISK with BRISK, B-F: BRISK with FREAK, H-S: Harris with SURF, H-B: Harris with BRISK, H-F: Harris with FREAK. L.P and R.P: Left and Right Points, L.V.P and R.V.P: Left and Right Valid Points, M.P: Matched Points, C.M: Correct Match. Table 9 Comparison of different combinations in Y space. Spaces
Combination
L.P
R.P
L.V.P
R.V.P
M.P
C.M
Recall
Precision
Y
S-S S-B S-F M-S M-B M-F B-S B-B B-F H-S H-B H-F
646 646 646 1583 1583 1583 186 186 186 243 243 243
699 699 699 1799 1799 1799 222 222 222 401 401 401
646 605 635 1583 1580 1581 186 185 185 243 243 243
699 659 690 1799 1796 1797 222 220 220 401 401 401
152 0 38 258 0 104 29 0 3 71 0 14
8
0.0124
0.053
2 3
0.0031 0.0019
0.053 0.012
2 0
0.0013 0
0.019 0
0 0
0 0
0 0
1
0.0041
0.071
Note: S-S: SURF with SURF, S-B: SURF with BRISK, S-F: SURF with FREAK, M-S: MSER with SURF, M-B: MSER with BRISK, M-F: MSER with FREAK, B-S: BRISK with SURF, B-B: BRISK with BRISK, B-F: BRISK with FREAK, H-S: Harris with SURF, H-B: Harris with BRISK, H-F: Harris with FREAK. L.P and R.P: Left and Right Points, L.V.P and R.V.P: Left and Right Valid Points, M.P: Matched Points, C.M: Correct Match.
space. From this Table, it can be seen that the largest number of keypoints were detected by MSER detector and afterwards SURF detector. Performance of MSER detector was similar to the results of RGB, NTSC, HSV and H space. Although various combinations extracted a lot of valid points, but many of these points matched incorrectly. Therefore precision and recall were not calculated for these combinations. Precision and recall were calculated for S-S, S-F, M-S, M-F and H-F combinations. Freak and SURF descriptors revealed the best performance in this space. H-F combination showed greater value on precision, but its recall was low value. In contrast, S–S produces the largest value of
Performance results of various combinations in YCbCr space are listed in Table 8. It is observed that the largest number of keypoints were detected by BRISK detector (This result was similar to results obtained by Canclini et al., 2013). SURF detector did not detect any keypoints in this space. Various combinations extracted a few valid points and many of these points were matched incorrectly. Precision and recall were only calculated for M-F combination. Combinations did not show well performance in this space. f) Y space Table 9 shows the performance results of various combinations in Y 193
Scientia Horticulturae 228 (2018) 187–195
A. Jafari Malekabadi et al.
Fig. 2. The correct and incorrect matched points.
Fig. 3. Precision and recall for combinations.
4. Conclusions
recall and its precision is the largest value after H-F. Thus S–S had the best performance and afterwards S-F and H-F have suitable performance in Y space.
In this study, the performance of twelve combinations of wellknown detectors and descriptors on stereo image pairs of an artifact tree was compared and evaluated. The evaluation was based on precision and recall values. Tested color spaces were included HSV, H, YCbCr, Y, NTSC and RGB. From the obtained results, it can be concluded that:
3.3.2. Performance of combinations together As mentioned above, precision and recall were calculated for S-S, SF, M-S, M-B and H-S combinations in RGB space, M-S and M-F combinations in HSV space, S-S, S-F, M-S and M-B combinations in H space, H-S combination in NTSC space, M-F combinations in YCbCr space and S-S, S-F, M-S, M-F and H-F combinations in Y space. Therefore performances of these combinations were compared together. Ten combinations were selected that had greatest value of recall. Fig. 3 shows the performance comparison of ten combinations in terms of precision and recall values. Comparing methods together, 4 and 3 combinations were in RGB and Y spaces, respectively. First and second score of precision and recall were in Y and RGB spaces, respectively. Out of ten methods, 5, 3 and 2 combinations used from S, H and M detectors, and 6 and 4 combinations used from S and F descriptors, respectively. Some results were similar to results obtained by Mikolajczyk and Schmid, 2005, Işık and Ozkan, 2015 and Canclini et al., 2013. Therefore RGB and Y spaces are the best spaces to implement combinations and S–S has the best performance and afterwards S-F and H-S have suitable performance.
• The largest number of keypoints were detected by MSER and SURF detectors almost in all possible spaces. • Best combinations used from SURF detector and descriptor when the precision and recall results were considered. • Comparing the various combinations in different spaces, RGB and Y spaces were the best spaces to implement combinations. • The best results were obtained when the SURF detector was com•
bined with the SURF descriptor, next SURF-FREAK and HARRISSURF showed suitable performance. As a future work, RANSAC algorithm can be employed before running matching algorithm to eliminate inconsistent matches and to obtain more correct matches.
Acknowledgement The authors gratefully acknowledge the financial support provided 194
Scientia Horticulturae 228 (2018) 187–195
A. Jafari Malekabadi et al.
Maas, H.G., Bienert, A., Scheller, S., Keane, E., 2008. Automatic forest inventory parameter determination from terrestrial laser scanner data. Int. J. Remote Sens. 29 (5), 1579–1593. Matas, J., Chum, O., Urban, M., Pajdla, T., 2004. Robust wide-baseline stereo from maximally stable extremal regions. Image Vision Comput. 22 (10), 761–767. Mikolajczyk, K., Schmid, C., 2005. A performance evaluation of local descriptors. Pattern Analysis and Machine Intelligence 27 (10), 1615–1630. Muhammad, S., Luedeling, E., Brown, P.H., 2009. A Nutrient budget approach to nutrient management in almond. Proceedings of the International Plant Nutrition Colloquium XVI, Department of Plant Sciences, UC Davis. Naesset, E., 1997a. Estimating timber volume of forest stands using airborne laser scanner data. Remote Sens. Environ. 61, 246–253. Naesset, E., 1997b. Determination of mean tree height of forest stands using airborne laser scanner data. ISPRS J. Photogramm. Remote Sens. 52, 49–56. Nistér, D., Stewénius, H., 2008. Linear time maximally stable extremal regions: computer vision (ECCV 2008). Springer 183–196. Obdržalek, D., Basovnik, S., Mach, L., Mikulik, A., 2010. Detecting scene elements using maximally stable colour regions, Research and Education (Robotics-EUROBOT 2009). Springer 107–115. Parker, G., Harding, D., Berger, M.L., 2004. A portable LIDAR system for rapid determination of forest canopy structure. J. Appl. Ecol. 41 (4), 755–767. Peng, Z., 2012. Efficient Matching of Robust Features for Embedded SLAM. Phattaralerphong, J., Sinoquet, H., 2005. A method for 3D reconstruction of tree canopy volume from photographs: assessment from 3D digitised plants. Tree Physiol. 25, 1229–1242. Rosell, J.R., Sanz, R., 2012. A review of methods and applications of the geometric characterization of tree crops in agricultural activities. Comput. Electron. Agric. 81, 124–141. Rosell, J.R., Llorens, J., Sanz, R., Arno, J., Ribes-Dasi, M., Masip, J., Escolà, A., Camp, F., Solanelles, F., Gràcia, F., Gil, E., Val, L., Planas, S., Palacin, J., 2009a. Obtaining the three-dimensional structure of tree orchards from remote 2D terrestrial LIDAR scanning. Agric. Forest Meteorol. 149, 1505–1515. Rosell, J.R., Sanz, R., Llorens, J., Arno, J., Escolà, A., Ribes-Dasi, M., Masip, J., Camp, F., Gràcia, F., Solanelles, F., Pallejà, T., Val, L., Planas, S., Gil, E., Palacin, J., 2009b. A tractor-mounted scanning LIDAR for the non-destructive measurement of vegetative volume and surface area of tree-row plantations: a comparison with conventional destructive measurements. Biosyst. Eng. 102 (2), 128–134. Rovira-Mas, F., Zhang, Q., Reid, J., 2005. Creation of three-dimensional crop maps based on aerial stereo images. Biosyst. Eng. 90 (3), 251–259. Russell, P., 2004. Recommended pesticide dose rates: how low can you go? Outlooks Pest Manage. 15 (6), 242–243. Saa, S., Muhammad, S., Brown, P.H., 2013. Development of leaf sampling and interpretation methods and nutrient budget approach to nutrient management in almond (Prunus dulcis (Mill.) D.A.Webb). ISHS Acta Hortic. 984, 291–296. Sanz-Cortiella, R., Llorens-Calveras, J., Escolà, A., Arno-Satorra, J., Ribes-Dasi, M., MasipVilalta, J., Camp, F., Gràcia-Aguilà, F., Solanelles-Batlle, F., PlanasDeMarti, S., Pallejà-Cabré, T., Palacin-Roca, J., Gregorio-Lopez, E., Del-Moral-Martinez, I., RosellPolo, J.R., 2011a. Innovative LIDAR 3D dynamic measurement system to estimate fruit-tree leaf area. Sensors 11 (6), 5769–5791. Sanz-Cortiella, R., Llorens-Calveras, J., Rosell-Polo, J.R., Gregorio-Lopez, E., PalacinRoca, J., 2011b. Characterisation of the LMS200 laser beam under the influence of blockage surfaces. Influence on 3D scanning of tree orchards. Sensors 11 (3), 2751–2772. Schaeffer, C., 2013. A Comparison of Keypoint Descriptors in the Context of Pedestrian Detection: Freak Vs. Surf Vs. Brisk. Stanford University CS Department. Solanelles, F., Escolà, A., Planas, S., Rosell, J.R., Camp, F., Gracia, F., 2006. An electronic control system for pesticide application proportional to the canopy width of tree crops. Biosyst. Eng. 95 (4), 473–481. Stephan, J., Sinoquet, H., Donès, N., Haddad, N., Talhouk, S., Lauri, P.E., 2008. Light interception and partitioning between shoots in apple cultivars influenced by training. Tree Physiol. 28, 331–342. Stuppy, W., Maisano, J., Colbert, M., Rudall, P., Rowe, T., 2003. Three-dimensional analysis of plant structure using high-resolution X-ray computed tomography. Trends Plant Sci. 8 (1), 2–6. Thorp, K.R., Tian, L.F., 2004. A review on remote sensing of weeds in agriculture. Precis. Agric. 5, 477–508. Tian, L.F., Slaughter, D.C., 1998. Environmentally adaptive segmentation algorithm for outdoor image segmentation. Comput. Electron. Agric. 21, 153–168. Van der Zande, D., Hoet, W., Jonckheere, I., Aardt, J., Coppin, P., 2006. Influence of measurement set-up of ground-based LIDAR for derivation of tree structure. Agric. Forest Meteorol. 141, 147–160. Yeh, Y.-H.F., Lai, T.-C., Liu, T.-Y., Liu, C.-C., Chung, W.-C., Lin, T.-T., 2014. An automated growth measurement system for leafy vegetables. Biosyst. Eng. 117, 43–50. Zaman, Q.U., Salyani, M., 2004. Effects of foliage density and ground speed on ultrasonic measurement of citrus tree volume. Appl. Eng. Agric. 20 (2), 173–178. Zaman, Q.U., Schumann, A.W., 2005. Performance of an ultrasonic tree volume measurement system in commercial citrus groves. Precis. Agric. 6 (5), 467–480. Zang, Q., Klette, R., 2004. Robust background subtraction and maintenance. Proc. of the 17th Int Conf. on Pattern Recognition, Vol. 2 2004 90–93. Zhang, Z., 1998. Determining the epipolar geometry and its uncertainty: a review. Int. J. Comput. Vision 27 (2), 161–1195.
by Ferdowsi University of Mashhad (Grant No. 31500). References Alahi, A., Ortiz, R., Vandergheynst, P., 2012. Freak: fast retina keypoint. Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 510–517. Andersen, H., Reng, L., Kirk, K., 2005. Geometric plant properties by relaxed stereo vision using simulated annealing. Comput. Electron. Agric. 49, 219–232. Aschoff, T., Thies, M., Spiecker, H., 2004. Describing forest stands using terrestrial laserscanning. In: Conference Proceedings ISPRS Conference. ISPRS International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences Vol. XXXV, Part B. Istanbul, Turkey, 12–23 July 2004, pp. 237–241. Canclini, A., Cesana, M., Redondi, A., Tagliasacchi, M., Ascenso, J., Cilla, R., 2013. Evaluation of low-complexity visual feature detectors and descriptors. In: 18th International Conference on Digital Signal Processing (DSP). 1–3 July. pp. 1–7. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L., 2008. Speeded-up robust features (SURF). Computer Vision and Image Understanding Vol. 110. pp. 346–359 Number 3. Bongers, F., 2001. Methods to assess tropical rain forest canopy structure: an overview. Plant Ecology 153, 263–277. Bouguet, J.Y., 2004. Camera calibration toolbox for matlab. Computational Vision at the California Institute of Technology. Bradski, G., Kaehler, A., 2008. Learning OpenCV: Computer Vision with the OpenCV Library. O'Reilly Media, USA. Buba, T., 2015. Impacts of different tree species of different sizes on spatial distribution of herbaceous plants in the Nigerian guinea savannah ecological zone. Scientifica 2015, 106930. El-gayar, M., Soliman, H., 2013. A comparative study of image low level feature extraction algorithms. Egypt. Inform. J., vol. 14 (2), 175–181. Giles, D.K., Delwiche, M.J., Dodd, R.B., 1987. Control of orchard spraying based on electronic sensing of target characteristics. Trans. ASAE 30 (1636), 1624–1630. Giles, D.K., Delwiche, M.J., Dodd, R.B., 1988. Electronic measurement of tree canopy volume. Trans. ASAE 31, 264–272. Giles, D.K., Delwiche, M.J., Dodd, R.B. 1989. Method and Apparatus for Target Plant Foliage Sensing and Mapping and Related Materials Application Control. U.S. Patent 4,823,268. Giles, D.K., Delwiche, M.J., Dodd, R.B., 1989b. Sprayer control by sensing orchard crop characteristics: orchard architecture and spray liquid savings. J. Agric. Eng. Res. 43, 271–289. Giuliani, R., Magnanini, E., Fragassa, C., Nerozzi, F., 2000. Ground monitoring the light shadow windows of a tree canopy to yield canopy light interception and morphological traits. Plant Cell Environ. 23, 783–796. Heikkila, J., Silven, O., 1997. A four-step camera calibration procedure with implicit image correction. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition 1106–1112. Heinly, J., Dunn, E., Frahm, J.-M., 2012. Comparative evaluation of binary features, Computer Vision–ECCV. Springer 759–773. Işık, S., Ozkan, K., 2015. A comparative evaluation of well-known feature detectors and descriptors International Journal of Applied Mathematics. Electron. Comput. 3 (1), 1–6. Jafari Malekabadi, A., Khojastehpour, M., Emadi, B., 2016. Comparing measurement methods of the geometric characterization of trees. In: The 10th National Congress on Biosystems Engineering (Agricultural Machinery) and Mechanization of Iran. 30-31 August, Ferdowsi University of Mashhad, Mashhad, Iran. Kise, M., Zhang, Q., 2006. Reconstruction of a virtual 3D field scene from groundbased multi-spectral stereo imaging. Proceedings of the 2006 ASABE Annual International Meeting, Portland, Oregon. Paper Number 063098. Krig, S., 2014. Computer Vision Metrics: Survey, Taxonomy, and Analysis. Apress. Berkely, CA, USA. Kushida, K., Yoshino, K., Nagano, T., Ishida, T., 2009. Automated 3D forest surface model extraction from balloon stereo photographs. Photogramm. Eng. Remote Sens. 75 (1), 25–35. Leblanc, S.G., Chen, J.M., Fernandes, R., Deering, D.W., Conley, A., 2005. Methodology comparison for canopy structure parameters extraction from digital hemispherical photography in boreal forest. Agric. Forest Meteorol. 129, 187–207. Lefsky, M.A., Cohen, W.B., Parker, G.G., Harding, D.J., 2002. Lidar remote sensing for ecosystem studies. Bioscience 52 (1), 19–30. Leutenegger, S., Chli, M., Siegwart, R.Y., 2011. BRISK: Binary robust invariant scalable keypoints. Computer Vision (ICCV) IEEE. pp. 2548–2555. Li, L., Huang, W., Yu-Hua Gu, I., Tian, Q., 2004. Statistical modeling of complex backgrounds for foreground object detection. IEEE Trans. Image Process., vol. 13 (11), 1459–1472. Lingua, A., Marenchino, D., Nex, F., 2009. Performance analysis of the SIFT operator for automatic feature extraction and matching in photogrammetric applications. Sensors 9 (5), 3745–3766. Llorens, J., Gil, E., Llop, J., Escolà, A., 2011. Ultrasonic and LIDAR sensors for electronic canopy characterization in vineyards: advances to improve pesticide application methods. Sensors 11 (2), 2177–2194. Loop, C., Zhang, Z., 1999. Computing rectifying homographies for stereo vision. The Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1: 125–131.
195