Using three dimensional convolutional neural networks for denoising echosounder point cloud data

Using three dimensional convolutional neural networks for denoising echosounder point cloud data

Applied Computing and Geosciences 5 (2020) 100016 Contents lists available at ScienceDirect Applied Computing and Geosciences journal homepage: www...

3MB Sizes 0 Downloads 51 Views

Applied Computing and Geosciences 5 (2020) 100016

Contents lists available at ScienceDirect

Applied Computing and Geosciences journal homepage: www.journals.elsevier.com/applied-computing-and-geosciences

Using three dimensional convolutional neural networks for denoising echosounder point cloud data David Stephens *, Andrew Smith, Thomas Redfern, Andrew Talbot, Andrew Lessnoff, Kari Dempsey UK Hydrographic Office, Admiralty Way, Taunton, Somerset, TA1 2DN, United Kingdom

A R T I C L E I N F O

A B S T R A C T

Keywords: 3D convolutional neural network Multibeam echosounder Point cloud Hydrographic survey Deep learning Bathymetry model

It is estimated that over 80% of the world’s oceans are unexplored and unmapped limiting our understanding of ocean systems. Due to data collection rates of modern survey technologies such as swathe multibeam echosounders (MBES) and initiatives such as Seabed 2030, there is ever-increasing increasing volume of seafloor data collected. These large data volumes present significant challenges around quality assurance and validation with current approaches often requiring manual input. The aim of this study is to test the efficacy of applying novel 3D Convolutional Neural Network models to the problem of removing noise from MBES point cloud data, with a view to increasing the automation of processing bathymetric data. The results reported from hold-out test sets show promising performance with a classification accuracy of 97% and kappa scores of 0.94 on voxelized point cloud data. Deploying a sufficiently trained model in a productionized processing pipeline could be transformational, reducing the manual intervention required to take raw MBES point cloud data to a bathymetric data product.

1. Introduction Bathymetry is the study and measurement of water depth and is fundamental to various domains such as maritime navigation, port management, offshore oil and resource exploration and habitat mapping for marine conservation. Historically, bathymetry data was collected using a lead (or sounding) line; a rope with a heavy weight attached and lowered from a ship until the weight hit the seabed (Dierssen and Theberge, 2016). Modern surveys typically use multibeam echosounders (MBES) mounted to survey vessels which collect thousands of depth measurements per second which greatly increases the rate of acquiring depth data, as well as the spatial coverage. However, modern MBES systems present a number of challenges due to the large volume of data collected which requires significant processing, quality assurance and validation to enable the data to be used with confidence (Calder and Mayer, 2003). The types of errors and noise in MBES data varies considerably depending on factors such as; the type of MBES system used, the sea state, the MBES settings and calibration and water-column conditions. In addition noise can increase through the presence of objects in the water

column (e.g., fish schools), bubbles from vessel cavitation, type of seabed, local changes in sound velocity through the water and vessel movements, etc (Arge et al., 2010). Current techniques used in processing of MBES point cloud data involve the application of statistical filters and manual inspection and cleaning, which can include point-by-point inspection of soundings. These current techniques have a number of drawbacks, including long processing times, a lack of reproducibility, often require parameters to be tuned for different areas, still require substantial manual effort and offer limited scalability. In particular when processing data for use in navigational products, automated processing tools often cannot provide consistent levels of accuracy and analysts revert to more semi-manual processing. The costs of developing new scientific products (e.g., charts, maps, elevation models etc.) are therefore sensitive to not only the costs of data collection (physical surveys and equipment purchase), but also the costs of data handling, processing and analysis; limiting society’s ability to generate bathymetric data products of interest in areas lacking direct economic imperative (e.g., developing nations). With ever increasing volumes of data and initiatives such as the

* Corresponding author. E-mail addresses: [email protected] (D. Stephens), [email protected] (A. Smith), [email protected] (T. Redfern), Andrew. [email protected] (A. Talbot), [email protected] (A. Lessnoff), [email protected] (K. Dempsey). https://doi.org/10.1016/j.acags.2019.100016 Received 23 May 2019; Received in revised form 9 December 2019; Accepted 9 December 2019 Available online 12 December 2019 2590-1974/Crown Copyright © 2019 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/ by-nc-nd/4.0/).

D. Stephens et al.

Applied Computing and Geosciences 5 (2020) 100016

(digital elevation models) to compare with a bathymetry surface derived from the expertly processed data.

Seabed2030 project (Nippon Foundation-GEBCO, 2018) aiming to “produce a definitive map of the world ocean floor by 2030” new computational techniques are required for the handling, processing and denoising of MBES point cloud data. In this paper we present an investigation into the application of deep learning Convolutional Neural Networks to the denoising of MBES point cloud data. Machine learning (the science of teaching computers to recognise patterns in data) and deep learning specifically (a branch of machine learning using multi-layered artificial neural networks) have shown excellent performance in a number of predictive challenges (Yann Le Cun et al., 1997; LeCun et al., 2015; Jain and Seung, 2009; Krizhevsky et al., 2012). Deep learning models are able to learn, from large pre-labelled datasets, complex non-linear functions required to correctly classify new unseen data (LeCun et al., 2015). Inspired by studies on mammalian vision systems (Goodfellow et al., 2016), Convolutional Neural Networks (CNNs) have been central to recent advances in machine learning and computer vision tasks ranging from object recognition (Krizhevsky et al., 2012), (Gondara, 2016; He et al., 2015), semantic segmentation (Drozdzal et al., 2016; Ronneberger et al., 2015; Mao et al., 2016; Noh et al., 2015; Hong et al., 2015; Shelhamer et al., 2016; Long et al., 2014) and image denoising (Jain and Seung, 2009; Xie et al., 2012; Kim et al., 2017; Vincent et al., 2010; Mao et al., 2016) to super-resolution (Yang et al., 2017; Mao et al., 2016); CNNs have been shown to perform well when applied to all of these tasks. Like other artificial neural network architectures, CNNs consist of a number of neuron layers that transform an input into a desired output (e.g., image classification label or segmented image mask). CNNs contain convolutional layers which apply transformational filters in a moving window across an image and in doing so, learn to pick out features that are pertinent to the task they are being trained to perform. Trained against labelled data, a CNN learns through a process of back propagation of error via stochastic gradient descent (LeCun et al., 2015) (or variant thereof). CNNs have been applied to point cloud data in the context of object detection and classification in Light Detection and Ranging (Lidar) data (Prokhorov, 2010; Xu et al., 2018) including for example landing zone detection for airborne vehicles (Maturana and Scherer, 2015b). Maturana and Scherer (2015a) proposed a three-dimensional CNN architecture for fast and accurate classification of Lidar point cloud data called VoxNet, the ‘voxel’ being the 3D equivalent of a pixel. There are several reasons CNNs are a good option for 3D object classification, mainly that the model makes explicit use of the spatial structure of the data. As the data is three dimensional the kernel filters are cubes and the convolutions are calculated along the width, height and depth of the input feature map. With labelled training data, the three dimensional CNN trains with the same backpropogation of error as its 2D equivalent. This paper investigates whether three dimensional Convolutional Neural Networks can be trained for use in denoising MBES point cloud data. To our knowledge this is the first attempt at applying CNNs to the problem of denoising three dimensional MBES point cloud data. In the context of this paper we are not distinguishing between different types of error and are referring to all soundings not deemed to represent the seabed as ‘noise’. It is likely that the source of noise comes from a combination of random measurement error, systematic bias, or gross errors/blunders (e.g. fish schools, bubbles). The problem addressed in this paper is analogous to the semantic segmentation of images where models are trained to make pixelwise class label predictions. We extend the two dimensional image segmentation problem to three dimensions where binary accept/reject predictions are made on a voxel array. CNN model architectures that have demonstrated good performance on two dimensional images are adapted and tested. The models are trained using MBES point cloud data processed by bathymetry analysts using standard analytical techniques and expert judgement. To assess the model’s performance for practical bathymetry data denoising, model predictions are then used to filter the point cloud data, which is subsequently rasterized to produce a bathymetry DEM

2. Material and methods 2.1. Data Four multibeam datasets were used in this study, collected in different locations of the Caribbean (Fig. 1) between 2016 and 2018 as part of the Commonwealth Marine Economies Programme (Gov.uk, 2016). The purpose of the surveys were to update Admiralty charts (UK Hydrographic Office, 2019) for navigational safety and for use in environmental seabed mapping. The Belize Buzzard shoal data (referred to as ‘Belize’ from now on) is part of the HI1563 survey (UK Hydrographic Office, 2018), collected off the coast of Belize City. It covers approximately 1.3 km2, with a total of 4.7  107 soundings and a mean depth of 31 m. Seabed characterizations for the Belize dataset include: sand, mud and coral. The other three datasets were collected as part of the HI1526 survey (UK Hydrographic Office, 2016) off St Vincent and the Grenadines. ‘SVG-B1’ was collected on the south coast of the island off Kingstown, covering ~14.3 km2, with mean water depth of 28 m. The ‘SVG-B4’ data is a section of the survey off the North west of St Vincent, with 8.8  107 soundings covering 6.5 km2 and a mean water depth of 47 m. ‘SVG-B6’ covers an ~2.9 km2 area off the south east of the island, has 2.4  108 soundings with a mean depth of 16 m. Seabed characterizations for the HI1526 surveys were: sand, gravelly sand and mud. The MBES system used in all surveys was a Reson 7125 system with 512 beams and an operating frequency of 400 kHz. Fig. 2 shows the distributions of depth soundings for each survey. A summary of the datasets is provided in Table 1. Associated with each sounding was the latitude, longitude (WGS84) and water depth in positive meters (referenced to Chart Datum). The surveys were processed by bathymetry analysts in a semi-automated manner in CARIS HIPS software (v10.4.2 Teledynecaris, 2018) with each sounding labelled as ‘Accepted’ to indicate the sounding correctly represented the seabed (or object on the seabed), or ‘Rejected’ if a sounding was considered noise. The 3D point cloud data was projected from WGS84 to UTM meters, based on the appropriate UTM zone for the survey location. 2.1.1. Voxels and chips The point cloud data was discretised into 3D grid cells referred to as ‘voxels’ with a resolution of 1 m. This was considered a reasonable voxel resolution given the sounding density and depth (the vast majority of soundings were <100 m) of the surveys in the study. The count of soundings in each voxel and whether any soundings were ‘Accepted’ was retained in the data array. In a similar approach to Maturana and Scherer (2015a), the full survey voxels were segmented into 3D cuboid arrays (we refer to as ‘chips’) with dimensions 32  32  64 voxels which provided the input features to the CNN. 32 voxels were used as the basis for the 3D chipped input so that successive 2  2  2 downsampling in the model resulted in

Fig. 1. Locations of the MBES surveys used in this study Belize and Saint Vincent and Grenedines (SVG). 2

D. Stephens et al.

Applied Computing and Geosciences 5 (2020) 100016

Fig. 2. Distribution of sounding depth by survey.

contains “Accepted” data. This is then classified into ‘Accepted’ or ‘Rejected’ using a threshold of 0.5.

a feature map output with valid sizes (even dimensions). As is common in the training of CNNs the input data were scaled in the range 0  1. This was done using minðn =100; 1Þ where n is the number of soundings within each voxel. The value of 100, rather than maximum value, was used for scaling because the distribution of soundings per voxel was highly skewed with some voxels having large sounding densities on the order of several thousand soundings. This would have caused the majority of voxels to have values close zero once scaled and would have adversely affected training the model. The majority of voxels had <100 soundings (The 95th percentile of soundings per voxel is 84, see Fig. 4). The vertical chip dimension (64 voxels) was chosen to be double that of the horizontal. This was done to try and capture the complete vertical spread of depth data. The chips were vertically centered around the mean of the soundings within the chip and extended 1 horizontal chip size (i.e., 32 m) above and below the mean. Soundings that were outside the chip range were discarded, see Fig. 5. In Fig. 5, the chip mean depth is at 0 m (indicated by the yellow square), soundings outside of the range ½z 32; z þ32 m (where z is mean chip depth) were removed. This means that chips did not always capture the entire vertical spread of the data, since in some areas the noise appeared far above or below the z value. While the majority of data discarded in this way was flagged as ‘Rejected’ in the training data, in some areas of steep slope good data (‘Accepted’ soundings) were sometimes lost. However the percentage of chips with missing data was small (~4%); this is addressed further in the discussion. The surveys used for training the model were focused on shallow coastal areas with most soundings being less than 100 m depth (Fig. 3). In order to make a reasonable cutoff for training the model, only chips with z < 75 m and N > 10 soundings were used to train the models. In summary the inputs and outputs for the CNN were: Input Data: 3D cuboid array with shape and depth data as illustrated by Fig. 5. There is only one input channel corresponding to the sounding density in each voxel scaled between 0 and 1. The training label is a binary representation of class label for each voxel, 0 ¼ Voxel contains zero ‘Accepted’ soundings; 1 ¼ Voxel contains >0 ‘Accepted’ soundings. Output Data: The trained model will output voxelwise binary classification, each voxel is assigned a probability corresponding to whether it

2.2. Experiments The CNN architecture tested was based on a modified version of UNet that was first used for bio-medical image segmentation (Ronneberger et al., 2015). It was adapted by making the convolutions three dimensional and reducing the number of convolutional layers in the network to reflect the smaller size of the input dimensions. The UNet is a symmetric architecture with contracting and expanding sections, this sometimes referred to as encoder-decoder CNN model. The convolution layers in the contracting section learns (‘encodes’) features which are useful for the problem. The expanding section rebuilds or decodes the image or in this case the 3D array, outputting voxelwise predictions. There were three units in the contracting path each consisting of two 3D convolutions followed by rectified linear activation function f ðxÞ ¼ maxð0; xÞ and a subsequent max pooling layer. At each down-sample (max-pooling) the number of filters kernels is doubled (thereby doubling the number of feature channels into the next layer). Each unit in the expanding section consists of 3D up-sampling and three convolutional layers. Batch normalization was applied after each convolutional layer, before the activation function (Ioffe and Szegedy, 2015). Full details of the model are given in the supplementary materials section. Convolutional kernel sizes of 3, 5 and 7 were tested, however the larger kernel size was not possible for all models due to computer memory constraints. A different number of convolutional filters were also tested (16, 32 and 64). A stride of 1 for the filters was constant throughout all experiments. In order for the output features to have the same size as the inputs, zero-padding for convolution and pooling layers was applied (referred to as padding ¼ ‘same’ mode in keras). The final layer used the sigmoid activation function for voxelwise binary classification f ðxÞ ¼

3

1 1 þ expðxÞ

(1)

D. Stephens et al.

Applied Computing and Geosciences 5 (2020) 100016

Fig. 3. Rasters surfaces generated from point cloud data (accepted data only). Pixel resolution is 2 m.

the training data for each batch on the fly by rotating each chip a random amount around the z axis. The model was trained using the ‘SVG-B4’, ‘SVG-B6’ and ‘Belize’ surveys. To prevent our model over-fitting to a particular survey, an equal number of chips were taken from each survey for the training set. This was equal to the survey with the smallest number of chips, and the remaining chips would be used as test sets (see Table 2 for details). The entire ‘SVG-B1’ dataset was set aside for an additional test set, as it was important to verify that the model was able to perform well in a completely new unseen area. A range of statistics were calculated to assess how well the models predicted the status label of each voxel in the test and validation sets. The accuracy statistic ACC ¼ ðTN þTPÞ=ðTN þTP þFN þFPÞ where TN ¼ True negative, TP ¼ True Positive, FN ¼ False negative, FP ¼ False positive). Also the False Positive Rate (FPR ¼ FP=ðFP þ TNÞ) and False Negative Rate (FNR ¼ FN=ðFN þ TPÞ). Kappa Statistic (Cohen, 1960) and the Jaccard Index (Jaccard, 1912) (also called Intersection Over Union) as well as the binary cross-entropy metric were calculated. Multiple statistics were calculated to avoid relying on single metric which

The batch size was fixed at 16 for all experiments, this was due to GPU memory constraints. Models were trained using the RMSprop optimizer (Hinton et al., 2018) which was minimizing the binary cross-entropy loss function L¼ 

n 1X yi logðpi Þ þ ð1  yi Þlogð1  pi Þ n i¼1

(2)

where y is the binary class label and p is the prediction probability that the observation is in class 1. Twenty percent of the training set was sampled at random for use as a validation set during training. The learning rate at the start of training was set to 0.001 and reduced by a factor of 10 if no improvement in the validation set loss was observed (patience of 4 epochs). All models were trained for 100 epochs. Early stopping was used if no improvement in the validation set loss was observed (patience of 16 epochs). The data was augmented when training the models. It has been shown that augmentation can be effective to prevent CNN models from over-fitting (Simard et al., 2003; Ciresan et al., 2010). A custom data generator function was used to augment

4

D. Stephens et al.

Applied Computing and Geosciences 5 (2020) 100016

Fig. 4. Distribution of number of soundings per voxel by survey (non-zero voxels only).

Table 1 Survey summary. Survey

Mean Depth (m)

Soundings (x106)

Area (km2)

Belize SVG-B1 SVG-B4 SVG-B6

31 28 47 16

47 377 88 241

1.3 14.3 6.5 2.9

Table 2 Training/test split between surveys. Note entire SVG-B1 was used as a separate test set. Survey

Total.Chips

Training.Set

Test.Set

Belize SVG-B4 SVG-B6 SVG-B1 Total

1312 2612 2892 6800 13616

1312 1312 1312 0 3936

0 1300 1580 6800 9680

may be biased; for example Accuracy is heavily affected by imbalanced class frequencies. The process of voxelisation and chipping results in very sparse data (on average 97.5% of voxels have zero soundings), as a result, performance statistics were only calculated on non-zero voxels (voxels with >0 soundings). In order to test the performance of the model in a more intuitive sense and to get an idea of how the model might perform when used in practice to produce bathymetry surfaces, the voxel predictions were used to filter the original point cloud data and generate bathymetry DEM rasters. To do this all soundings were assigned the same predicted label as the voxel which they fall into and only ‘Accepted’ soundings were then rasterized. In the rasterizing process each pixel is assigned the value of shallowest sounding in that pixel, while perhaps not providing the best representation of the surface, is typical in hydrographic mapping for navigational products. The root mean squared error (RMSE) and mean absolute error (MAE) were then computed to assess how the generated surfaces differ from the manually processed data. These statistics were also calculated

Fig. 5. Schematic of a chip made up of 32  32  64 voxels. Chip (indicated by the blue lines) is centered at the mean of sounding depths (yellow line). Soundings falling outside the vertical extent of chip are disguarded (shown in red). Soundings inside the chip (green) are attributed to voxels.

5

D. Stephens et al.

Applied Computing and Geosciences 5 (2020) 100016

Table 3 Model comparison. Sorted by test loss. Best scores are in bold. Test set scores are for SVG-B4/6 test set (including zero voxels). model

train_loss

val_loss

test_loss

test_kap

UNET-4-32-5 UNET-4-32-3 UNET-4-64-3 UNET-4-64-5 UNET-4-16-5 UNET-4-16-3 UNET-4-32-7

0.00165 0.00172 0.00169 0.00168 0.00182 0.00184 0.00187

0.00261 0.00193 0.00221 0.00234 0.00219 0.00203 0.00236

0.00171 0.00173 0.00173 0.00179 0.0018 0.00181 0.00225

0.984 0.985 0.985 0.984 0.984 0.984 0.982

Table 5 SVG-B1 test set confusion matrix (non-zero voxels only). Rows are the true class label, columns are the predicted label. The values in the class.error column are the False Positive Rate (FPR) and the False Negative Rate (FNR).

Rejected Accepted

Accepted

class.error

2368702 79134

139499 6842592

0.056 0.011

the point cloud data using the model’s voxel predictions. The rasters corresponding to the chips shown in 6 are shown in Fig. 7. The observed vs predicted (expertly processed vs CNN predictions) rasters in general indicate good results being achieved when denoising the point cloud with the CNN model. RMSE and MAE statistics were calculated for every chip in the test set. Percentiles were calculated on these statistics across all tiles. To provide insight into how much the raster error statistics were reduced by the model (IE how much noise the model was able to remove), error statistics on the uncleaned ‘null’ data were also calculated and compared to the model in Table 7. They show that 95% of the chips have mean absolute error of less than 1.4 m which is impressive considering the voxel resolution is 1 m thereby even a correct prediction can lead to an error of up to 1 m.

using the ‘null’ model (raw data) to observe how much the error has been reduced by the model. Experiments were carried out in R (R Development Core Team, 2018) using the Keras package (Allaire and Chollet, 2018). An NVIDIA Tesla V100 GPU with 16 GB RAM was used for training. 3. Results A total of seven models were trained and tested. The best performing models are listed in Table 3. The numbers appended to the model name refer to different model hyperparamters: model-[number of convolution layers]-[number of filters in each layer]-[kernel size]. The results suggest that the best performing model had 32 convolutional filters and kernel size of five voxels although there was little to separate the top models. The training, validation and test set loss scores were broadly similar which suggests that over-fitting to the training data was well controlled. Training of all models was stopped before 50 epochs as the validation loss was no longer falling. It is hard from the binary cross-entropy loss scores to get an intuitive sense of how the models performed. The kappa, accuracy and Jaccard coefficients provide an additional metric for measuring the model performance. The performance metrics are strongly inversely correlated with the test loss but not perfectly so; this is because the binary-cross entropy uses the probability of class label and penalizes the magnitude of the error whereas kappa and Jaccard metrics do not. Table 4 is the confusion matrix for the combined SVG-B4/6 test set chips and Table 5 shows the confusion matrix for the SVG-B1 test set. The test sets were assessed separately because where SVG-B4/6 data was split for a training/test, the SVG-B1 survey was not used in training at all. They show very similar results in terms of false positive and negative rates and class errors, which is encouraging. Table 6 shows accuracy, kappa score and Jaccard index calculated for the two test sets. The accuracy statistics for the SVG-B1 test set were very similar to the SVG-B4/6 test set which, considering the fact it is much larger survey and was in a completely new area to the training set, is encouraging. Fig. 6 shows the point cloud data for individual chips from the test set. The left-hand column shows the status label assigned by expert analysts, the right-hand column was assigned using the model predictions. A close inspection of this highlights that there are substantial amounts of ‘Rejected’ soundings interspersed among the surface of ‘Accepted’ soundings. This could be a result of ‘over-cleaning’ by the semi-manual processing of the training data. Examining the predictions, the model appears to be able to perform well with different patterns of noise, whether the noise is above or below the seabed, for more complex seabed types and also in areas where the seabed is highly variable. Rasters for each chip in the SVG-B4/6 test set were created by filtering

4. Discussion The aim of this study was to test the effectiveness of training and applying Convolutional Neural Network models to the problem of denoising MBES point cloud data. The results reported from the hold-out test sets show a promising performance with kappa scores of 0.94 and accuracy of 0.977. The fact that error statistics for the SVG-B1 test set, which was in a completely new area to the training data, were very similar to the SVG-B4/B6 test set was particularly encouraging. This demonstrates the potential of a CNN, trained on sufficient data, to geogeneralize well to new unseen data in a new location. This is supported by the error statistics generated on the DEM surface rasters generated by the expertly processed vs CNN denoised point cloud data. Overall this study demonstrates the practical applicability of the CNN denoising methodology to the production of bathymetric products from raw point cloud data. Using trained CNN models could reduce the manual intervention required well as improving repeatability and reproducibility in the processing of MBES survey data and resulting publication of bathymetric products, contributing to safer navigation and more efficient economic development of the marine economy. While the model trained here was able to perform well on unseen data collected in a similar region of the Caribbean, it was not the aim of the paper to train a comprehensive model which can broadly generalize to any type of seabed. A more comprehensive model would need to be trained on considerably more data, with more attention given to ensure that all seabed types of interest were adequately represented. Further work is required to establish whether the approach works sufficiently well on more complex seabed features eg rocky reefs, boulders, steep slopes or which there were only limited examples in the surveys used here. The nature of the voxel approach used here means that even with a perfect voxel classification, there would still be a potential error of up to 1 voxel size (1 m in this instance) in any given pixel of the DEM generated from the denoised data. This is a limitation of the voxelisation approach but it remains a pragmatic way of dealing with large point-cloud datasets as CNN models cannot handle irregular 3D data. However in

Table 4 SVG-B4/6 test set confusion matrix (non-zero voxels only). Rows are the true class label, columns are the predicted label. The values in the class.error column are the False Positive Rate (FPR) and the False Negative Rate (FNR).

Rejected Accepted

Rejected

Table 6 Test set error statistics (calculated on non-zero voxels only).

Rejected

Accepted

class.error

Test set

Accuracy

Kappa

J-index

1439596 44526

63814 3137854

0.042 0.014

SVG-B4/6 SVG-B1

0.977 0.977

0.947 0.940

0.967 0.969

6

D. Stephens et al.

Applied Computing and Geosciences 5 (2020) 100016

experimentation because of its success in a number of semantic segmentation problems (Ronneberger et al., 2015; Yao et al., 2018). However there are numerous CNN model architectures which have been used successfully for semantic segmentation (Noh et al., 2015; Hong et al., 2015; Shelhamer et al., 2016; Long et al., 2014) in images that could also be extended to 3D problems and these architectures could provide improved performance. The UNet model uses max pooling in the encoder and up-sampling in the decoder sections of the model. The process of max pooling is useful in classification because it filters out noisy activation, however in segmentation type problems, fine scale spatial detail can be lost (Noh et al., 2015). Noh et al. (2015) uses a symmetric convolutional-deconvolutional type of architecture for semantic segmentation and a similar approach is also used for image denoising by Mao et al. (2016). One of the key characteristics of this alternative model architecture is the use of deconvolutions rather than up-sampling for decoding the image (deconvolutions can be referred to as “learnable up-samples”). Deconvolutions (also referred to as transposed convolutions) apply a transformation in the opposite direction to convolutions, the dimensions of input and output feature maps for deconvolutions are inverse that of convolutions (Dumoulin and Visin, 2016). It would be beneficial to test whether applying this type of model architecture could further improve the model performance reported here. The adapted UNet model used in this paper as well as the model described by Mao et al. (2016) used skip layer connections between encoder and decoder components. The benefits of adding skip layer connections for various machine vision tasks have been highlighted by Mao et al. (2016), Drozdzal et al. (2016), Ronneberger et al. (2015). The purpose of adding these connections between earlier and later layers has been shown to be useful with deep networks. In image denoising it has been shown to allow spatial detail to be passed to later layers which makes the training faster and more effective (Mao et al., 2016). Skip layers have also shown they are beneficial for convergence in very deep networks Drozdzal et al. (2016), Szegedy et al. (2016), and for recovering the full spatial resolution for image segmentation (Ronneberger et al., 2015). Many of the false positives and false negatives fall adjacent to the seabed. The approach used by Ronneberger et al. (2015) for their original UNet model was to compute a weight map for pixels which is then used to calculate a weighted loss function with higher weighting given to pixels at the boundaries of segments in the image. A similar approach could be used to improve the model here with higher weighting applied to voxels bordering the seabed. This would force the model to ‘focus’ on the seabed while training. A limited amount of data augmentation was carried out in this study in the form of rotating chips at random around the z-axis. When putting the model into a productionized environment it will be desirable to carry out more extensive data augmentation including mirroring and vertical shifting of the data. This would hopefully reduce the possibility of overfitting and increase the capacity of the model to generalize to new areas of seabed. The method used here of scaling the sounding densities (dividing by 100 and then clipping to 1) was thought to be a reasonable choice based on an inspection of the data, However, this would not be appropriate in all circumstances. The data density present in any MBES survey depends on sounding frequency, water depth as well as vessel speed and attitude and also the number of passes the vessel makes over each location. Assuming a constant vessel speed and consistent amount of overlap between survey lines (it is common practice for surveys to have a significant amount of overlap i.e. any point on the seabed should be surveyed by at least two transects) density will mainly be a function of water depth. A more general way of scaling the sounding density, perhaps incorporating water depth would be an improvement to the method reported here. Maturana and Scherer (2015a) uses a probabilistic formulation of an occupancy grid to represent the density of Lidar ‘hits’ in each voxel. This is based on the fact that different objects are likely be more or less porous

Fig. 6. The raw point cloud data for a selection of test set chips. The left column is observed label (blue is accepted, red is rejected), second column is the label predicted by the model.

productionized system this limitation could be overcome by creating multiple chips at varying vertical offsets so that each sounding has multiple predictions. More generally a multi-resolution approach, with multiple predictions being made at different spatial scales would likely be beneficial in practice. The U-Net model architecture was chosen in this study as a basis for 7

D. Stephens et al.

Applied Computing and Geosciences 5 (2020) 100016

Fig. 7. Raster surfaces generated from the same test chips as previous figure. Rasters are 32  32 1 m pixels. The first column is the raster generated using all the data, second column observed (manually cleaned) and the third column is the surface generated using model predictions.

requiring little intervention by an analyst and it would increase the speed at which bathymetric products could be produced from survey data. Grid based depth estimation algorithms such as CUBE (Calder and Mayer, 2003) and CHRT (Calder and Rice, 2017) algorithm are tools currently used in processing of MBES data. The CUBE algorithm works on a grid basis data and for each grid node will produce a number of depth hypotheses. These algorithms do not filter data but rather attempts to identify the most likely depth at each grid node. Once a number of depth hypotheses have been generated there are a number of ways to select which is most likely to be correct. A simple approach is to choose the one supported by the most data, another is to compare the hypotheses with those in a local neighborhood. For CUBE to work optimally it requires the removal of systematic errors and outliers Arge et al. (2010). The method we outline here could potentially be used as a prior step in processing to remove the majority of noise before a surface is generated by a depth estimation such as CUBE/CHRT. The chipping method used here centered the chip in the z direction on the mean of the soundings within the spatial extent of the chip. As the chips have a vertical range of 64 m in some cases data is lost because it falls outside this vertical range. This is not an issue if the lost data is noise

Table 7 SVG-B4/6 test set raster statistics. ‘null’ indicates the uncleaned data. Statistic

null

model

%reduction

rmse95 mae95 rmseMedian maeMedian

30.97 24.70 4.48 1.29

3.46 1.40 0.11 0.02

88.8 94.3 97.5 98.4

to Lidar beams. This technique requires that you have a realistic estimate of the number of beams passing through each cell. A similar approach could be applied to MBES data given an appropriate model of the beam geometry. A potential drawback of the approach taken here is that it requires extensive parameter tuning over large sets of labelled data to train the model. Other methods such as filtering soundings based on their distance from a fitted 2D surface may give reasonable results on limited seabed types without the need for parameter tuning. However, the advantage of the deep learning CNN approach demonstrated here is that once a model is trained on sufficient data it has the potential to consistently filter spurious soundings with high accuracy on diverse types of seabed, 8

D. Stephens et al.

Applied Computing and Geosciences 5 (2020) 100016

but sometimes in areas of steep slopes ‘Accepted’ soundings were also lost. In the three surveys used here approximately 4% of the total chips missed ‘Accepted’ data, with all of these chips coming from the SVG-B4 survey. In it’s current form the method is not appropriate for areas of steep slopes. We think there are various ways of overcoming this problem, such as anchoring the chips to a specific depth rather than centering on the chip z and training a depth specific model. While this is an issue that needs to be addressed if this method is used in ‘production’, it does not invalidate the main aim of the study which is to test the efficacy of CNN models for the denoising of MBES data.

References Allaire, J.J., Chollet, François, 2018. Keras: R interface to ’Keras’. https://cran.r-project.o rg/package¼keras. Arge, Lars, Larsen, Kasper Green, Mølhave, Thomas, van Walderveen, Freek, 2010. Cleaning massive sonar point clouds. Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems - GIS ’10 2, 152. https://doi.org/10.1145/1869790.1869815. Calder, B.R., Mayer, L.A., 2003. Automatic processing of high-rate, high-density multibeam echosounder data. Geochem. Geophys. Geosyst. 4 (6) https://doi.org/ 10.1029/2002GC000486. Calder, B.R., Rice, G., 2017. Computationally efficient variable resolution depth estimation. Comput. Geosci. 106 (May), 49–59. https://doi.org/10.1016/ j.cageo.2017.05.013. Elsevier Ltd. Ciresan, Dan Claudiu, Meier, Ueli, Gambardella, Luca Maria, Schmidhuber, Jurgen, 2010. Deep big simple neural nets excel on hand- written digit recognition. arXiv 1–14. Cohen, J., 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20 (1), 37–46. Dierssen, Heidi M., Theberge, Albert E., 2016. Bathymetry: seafloor mapping history. In: Encyclopedia of Natural Resources: Water, vol. 2, pp. 644–648. 2. https://doi.org/ 10.1081/E-ENRW-120047531. CRC Press. Drozdzal, Michal, Vorontsov, Eugene, Gabriel, Chartrand, Kadoury, Samuel, Pal, Chris, 2016. The importance of skip connections in biomedical image segmentation. arXiv. https://doi.org/10.1007/978-3-319-46976-8. Dumoulin, Vincent, Visin, Francesco, 2016. A guide to convolution arithmetic for deep learning. arXiv 1–28. https://doi.org/10.1051/0004-6361/201527329. Gondara, Lovedeep, 2016. Medical image denoising using convolutional denoising autoencoders. arXiv. https://doi.org/10.1109/ICDMW.2016.0041. Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, 2016. Convolutional Networks. MIT Press http://www.cs.toronto.edu/{%7E}tijmen/csc321/slides/lecture{\_}slides{\_} lec6.pdf. Gov.uk, 2016. Commonwealth marine Economies Programme. https://www.gov.uk/ guidance/commonwealth-marine-economies-programme. He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian, 2015. Deep residual learning for image recognition. arXiv. December. http://arxiv.org/abs/1512.03385. Hinton, Geoffrey, Srivastava, Nitish, Kevin, Swersky. Neural Networks for Machine Learning. http://www.cs.toronto.edu/tijmen/csc321/slides/lectureslideslec6.pdf. Hong, S., Noh, H., Han, B., 2015. Decoupled deep neural network for semi-supervised semantic segmentation. arXiv 1–9, 1506.04924. Ioffe, S., Szegedy, C., 2015. Batch normalization : accelerating deep network training by reducing internal covariate shift. ArXiv, 1502.03167. Jaccard, Paul, 1912. The distribution of the flora in the alpine zone. New Phytol. 11 (2). Jain, Viren, Seung, Sebastian, 2009. Natural image denoising with convolutional networks. Advances in Neural Information Processing Systems 21 (NIPS 2008). htt p://papers.nips.cc/paper/3506-natural-image-denoising-with-convolutional-net works.pdf. Kim, Juhwan, Song, Seokyong, Son, Cheol Yu, 2017. Denoising auto-encoder based image enhancement for high resolution sonar image. In: 2017 IEEE OES International Symposium on Underwater Technology. https://doi.org/10.1109/ UT.2017.7890316. UT 2017. Krizhevsky, Alex, Sutskever, Ilya, Hintopn, Geoffrey, 2012. ImageNet classification with deep convolutional neural networks. Neural Information Processing Systems (NIPS). http://arxiv.org/abs/1102.0183. Le Cun, Yann, Bottou, L., Bengio, Yoshua, 1997. Reading checks with multilayer graph transformer networks. In: 1997 Ieee International Conference on Acoustics, Speech, and Signal Processing, 1:151–54. IEEE Comput. Soc. Press. https://doi.org/10.1109/ ICASSP.1997.599580. LeCun, Yann, Bengio, Yoshua, Hinton, Geoffrey, 2015. Deep learning. Nature 521, 346–444. Long, Jonathan, Shelhamer, Evan, Darrell, Trevor, 2014. Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR.2015.7298965. November. Mao, Xiao-Jiao, Shen, Chunhua, Yang, Yu-Bin, 2016. Image restoration using convolutional auto-encoders with symmetric skip connections. arXiv 1–17. https:// doi.org/10.1109/ICASSP.2018.8462085. Maturana, Daniel, Scherer, Sebastian, 2015a. VoxNet: a 3D convolutional neural network for real-time object recognition. In: IEEE International Conference on Intelligent Robots and Systems 2015-Decem. https://doi.org/10.1109/IROS.2015.7353481, 922–28. Maturana, Daniel, Scherer, Sebastian, 2015b. 3D convolutional neural networks for landing zone detection from LiDAR. In: IEEE International Conference on Robotics and Automation (Icra), 3471–8. IEEE. https://doi.org/10.1109/ICRA.2015.7139679. Nippon Foundation-GEBCO, 2018. The Nippon Foundation-GEBCO Seabed 2030 Project. https://seabed2030.gebco.net/. Noh, Hyeonwoo, Hong, Seunghoon, Han, Bohyung, 2015. Learning deconvolution network for semantic segmentation. arXiv 1. Prokhorov, Danil, 2010. A convolutional learning system for object classification in 3-D lidar data. IEEE Trans. Neural Netw. 21 (5), 858–863. https://doi.org/10.1109/ TNN.2010.2044802. R Development Core Team, 2018. “R: a language and environment for statistical computing. In: R Development Core Team (Ed.), R Foundation for Statistical Computing. R Foundation for Statistical Computing; R Foundation for Statistical Computing. https://doi.org/10.1007/978-3-540-74686-7. Ronneberger, Olaf, Fischer, Philipp, Brox, Thomas, 2015. U-net: convolutional networks for biomedical image segmentation. arXiv 1–8.

5. Conclusion This paper demonstrates the potential for Convolutional Neural Networks to be applied to the denoising of MBES point cloud data. A CNN trained on expertly processed MBES data is able to process new unseen survey data with a high level of accuracy. Applying a trained CNN in a productionized MBES processing pipeline could reduce the requirement for manual intervention while improving reproducibility and repeatability when creating bathymetric data products from raw MBES data. The application of state-of-the-art machine learning algorithms to the marine data sector has the potential to improve the currency, accuracy and coverage and resolution of geospatial data in the marine environment, facilitating improved navigational safety environmental protection and sustainable economic development of the global marine economy. Declarations of interest None. Authorship Statements David Stephens contributed towards the conception and design of study, analysis and interpretation of data, drafting the manuscript and revising the manuscript critically for important intellectual content. Andrew Smith contributed towards the conception and design of study, analysis and interpretation of data, drafting the manuscript and revising the manuscript critically for important intellectual content. Dr Thomas Redfern contributed towards the conception and design of study, analysis and interpretation of data, drafting the manuscript and revising the manuscript critically for important intellectual content. Andrew Talbot contributed towards the conception and design of study, analysis and interpretation of data, drafting the manuscript and revising the manuscript critically for important intellectual content. Andrew Lessnoff contributed towards analysis and interpretation of data and drafting the manuscript. Dr Kari Dempsey contributed towards the conception and design of study, analysis and interpretation of data, drafting the manuscript and revising the manuscript critically for important intellectual content. Acknowledgements We are grateful to Brain Calder and Kim Lowell at The Center for Coastal and Ocean Mapping/Joint Hydrographic Center for providing very helpful feedback on the manuscript. Thanks to Mike Hudgell for providing comments on the manuscript and to Catherine Seale who coined the phrase ‘geo-generalize’. The data used in this study was collected under the Commonwealth Marine Economies Programme (Gov.uk, 2016). Thankyou to the two anonymous reviewers for taking their time to read and understand manuscript and provide constructive comments which helped to improve the paper. Appendix A. Supplementary data Supplementary data related to this article can be found at https:// doi.org/10.1016/j.acags.2019.100016. 9

D. Stephens et al.

Applied Computing and Geosciences 5 (2020) 100016 deep network with a local denoising criterion pierre-antoine manzagol. J. Mach. Learn. Res. 11, 3371–3408. https://doi.org/10.1111/1467-8535.00290. Xie, Junyuan, Xu, Linli, Chen, Enhong, 2012. Image denoising and inpainting with deep neural networks. NIPS (News Physiol. Sci.) 1–9. https://nips.cc/Conferences/2012/ Program/event.php?ID¼3279. Xu, Zewei, Guan, Kaiyu, Nathan, Casler, Peng, Bin, Wang, Shaowen, 2018. A 3D convolutional neural network method for land cover classification using LiDAR and multi-temporal Landsat imagery. ISPRS J. Photogrammetry Remote Sens. 144 https://doi.org/10.1016/j.isprsjprs.2018.08.005 (December 2017). Elsevier: 423–34. Yang, Fan, Xu, Wei, Tian, Yapeng, 2017. Image Super Resolution Using Deep Convolutional Network Based on Topology Aggregation Structure, 020185. https:// doi.org/10.1063/1.4993002. Yao, Wei, Zeng, Zhigang, Lian, Cheng, Tang, Huiming, 2018. Pixel-wise regression using U-Net and its application on pansharpening. Neurocomputing 27, 364–371.

Shelhamer, Evan, Long, Jonathan, Darrell, Trevor, 2016. Fully convolutional networks for semantic segmentation. arXiv 1–12. May. http://arxiv.org/abs/1605.06211. Simard, Patrice Y., Steinkraus, Dave, Platt, John C., 2003. Best practices for convolutional neural networks applied to visual document analysis. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), no. Icdar, pp. 1–6. Szegedy, Christian, Ioffe, Sergey, Vincent, Vanhoucke, Alemi, Alex, 2016. Inception-v4, inception-ResNet and the impact of residual connections on learning. Pattern Recognit. Lett. 42 (February), 11–24. https://doi.org/10.1016/j.patrec.2014.01.008. Teledynecaris, 2018. HIPS and SIPS. https://www.teledynecaris.com/en/products/hipsand-sips/. UK Hydrographic Office, 2016. HI1526 Report of Survey. UK Hydrographic Office, 2018. HI1563 Report of Survey. UK Hydrographic Office, 2019. ADMIRALTY Charts. https://www.admiralty.co.uk/charts. Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, Yoshua, Manzagol, PierreAntoine, 2010. Stacked denoising autoencoders: learning useful representations in a

10