Ambitions of the Project

This section provides an overview on the ambitions of the DeeperSense project per use case and algorithms.

Copyright: DFKI, Meltem Fischer

SoA Inter-Sensoric Learning

Fusing different sensor modalities is a standard approach followed in robotics, and used generally for detecting objects, estimating environmental parameters, mapping and localization of robotic systems within their environments. Classically, integrating data from different modalities relied on mostly statistical frameworks such as Kalman Filters and its variants, or geometric and topological methods (Kam, Zhu, & Kalata, 1997). Sensor fusion has also been an important aspect of underwater vehicle navigation, where (Nicosevici, Garcia, Carreras, & Villanueva, 2004) provide a comprehensive review of fusion techniques for underwater applications. With a focus on seafloor mapping, (Singh, Roman, Whitcomb, & Yoerger, 2000) and (Kunz, 2012) present optical and acoustic data fusion techniques for merging photomosaics and multibeam bathymetric maps. (Negahdaripour, 2007) combined the features of a forward-looking sonar and an optical camera by deriving the constraint equations for epipolar geometry of a pinhole camera model and stereo triangulation. A comprehensive overview of various conventional fusion techniques of optical and acoustic sensors is reported in (Ferreira, Machado, Ferri, Dugelay, & Potter, 2016).

In this work, however, we are interested in an unconventional technique of fusing different sensor modalities that is by translating the features of one modality to another. This concept is inspired from research fields such as automatic language translation and image-to-image translation. In (Isola, Zhu, Zhou, & Efros, 2017) a pixel to pixel mapping approach was presented showing its capability to generate images from label maps, reconstruct objects from edge maps, and generate colored images from greyscale images. A method known as WaterGAN was presented by (Li, Skinner, Eustice, & Johnson-Roberson, 2017) for correcting colors in underwater images.

Recent advancements in deep learning have also made its way into the realm of acoustics. With BatVision, (Haahr Christensen, Hornauer, & Yu, 2019) describe a system that mimics the echolocation used by bats. BatVision uses the reflections of acoustic chirps emanated from a conventional loud-speaker to create surprisingly accurate 3-D representations (images) of an office environment. The system learns to interpret the acoustic sound patterns from video feeds and optically generated 3-D data which are recorded in parallel and used to train an artificial neural network. (Haahr Christensen et al., 2019) were able to show that the trained network can interpret sound patterns recorded in another section of the office building as meaningful and surprisingly accurate representations of the floor layout.

Terayama, Shin, Mizuno, & Tsuda (2019) used a combination of an imaging sonar and an optical camera for fish monitoring in aquaculture farms during night-time. By training a GAN model with optical and acoustic data collected in day light, they were able to generate visual images of the fishpond during the night. To match features between optical and acoustic images, the authors (Jang, Kim, Lee, & Kim, 2019) used a CNN-based approach as opposed to the GAN in the two previous works. In this method, the features from the sonar image were extracted using a pretrained VGG19 model (Simonyan & Zisserman, 2014), and extracting deepest layer of CNN in order to retain semantic information only. Then, a generation step is performed which minimizes the content and style loss between the generated and the optical image, thus transferring the style of the optical image into the sonar one.

In the works mentioned above, the authors only demonstrate the capabilities of provided method on very specific applications as the data collected is limited by the studied environments. In DeeperSense we plan to investigate further the concept of inter-sensoric learning to tackle a wider range of underwater environments and applications and provide a more general system that is able to interpret a variety of scenes and objects.


DeeperSense will develop an acoustic-to-visual system (Sound2Vision) that can transfer low resolution acoustic sonar data into representations and visualizations of the environment that surrounds an underwater robot. These visualizations will be more accurate than the original sonar images and easier to interpret for both human operators and autonomous robotic systems.

SoA Obstacle Avoidance

OA in AUVs is typically performed using acoustic sensors, such as multibeam echosounders, side scan sonars or synthetic aperture sonars. As the spatial resolution of these sensors is limited, AUV missions are usually performed in areas where the bathymetry is known a-priori and a safe altitude above the seabed (roughly 10m-50m) can be kept. In such missions, the main hazard to the vehicle comes from unexpected obstacles on the seabed or drifting objects in the water column. Consequently, the OA schemes are typically rudimentary.

The most basic OA systems for AUVs rely on altitude data fused with a (acoustic) narrow beam altimeter located in the AUV’s nose and aimed at roughly 45° to the horizon, to provide a fast escape reaction in case of an unexpected obstacle in front the vehicle.

More advanced OA systems employ also an FLS. This acoustic sensor is capable of providing a 2D horizontal or vertical image, depending on its positioning. The image contains range information but only one of the spatial axes, its resolution is limited and also not useful for the close range (below 5m) because of reverberations. Although it is common practice to use the FLS for OA, the path-planning possible with this sensor data is very limited. Usually it is only possible to instruct the vehicle to ascend when the FLS senses an obstacle ahead, which is not much better than the altimeter-based decisions.

Obviously, the currently available solutions are unsatisfactory for missions where the AUV is required to operate in a complex and dense environment, such as a coral reef.

To obtain 3D acoustic images that would support more sophisticated OA schemes, some solutions use two FLSs perpendicularly installed in the AUV nose (Braginsky & Guterman, 2016; Horner, McChesney, Masek, & Kragelund, 2009). However, those solutions are cumbersome, expensive and do not overcome the inherent drawbacks of acoustic images, i.e. low resolution and limited use for close-range operations.
In the terrestrial domain, the fusion of sensor data from radars and cameras has proved successful for OA in autonomous cars and unmanned surface vehicles (Hermann, Galeazzi, Andersen, & Blanke, 2015). Thus, in the underwater domain, there is a growing interest to copy this concept by using optical and acoustic sensors (i.e., sonars which have a similar working principle as radars) simultaneously underwater (Ferreira et al., 2016). High-resolution multibeam data (obtained from downward looking scans) was successfully merged with optical data (Babaee & Negahdaripour, 2015; Johnson-Roberson, Pizarro, & Willams, 2009). FLS, however, has a lower resolution and it is therefore significantly more difficult to combine these sensors with cameras. In (Cotter, Matzner, Horne, Murphy, & Polagye, 2016) an FLS was used in conjunction with cameras, but the data was not fused. Only recently a first attempt to fuse sonar and optical data to localize objects in front of an AUV was presented (Raaj, John, & Jin, 2016), however, without any attempt to enhance the visual image.
Fusing multiple images of different sensor modalities to provide better processing and/or to improve image quality is a challenging task. Nevertheless, in (Drozdov, Shapiro, & Gilboa, 2016) it was shown that low quality (w.r.t. resolution and accuracy) 3D geometry data can be greatly enhanced by fusing optical side-information.


With EagleEye, we will use Machine Learning and Deep Neural Networks to fuse data from long-range low-resolution acoustic sensors such as Forward-Looking Sonars (FLS), with short-range high-resolution optical data from Forward-Looking Cameras (FLC). EagleEye will significantly enhance the capabilities of FLCs, and Obstacle Avoidance (OA) systems based on FLCs, in that it empowers them to perceive obstacles even under bad visibility conditions or when the obstacles are still far away.

Seabed Mapping and Classification

Seabed Mapping and Classification aims at distinguishing marine benthic habitat characteristics of the surveyed area (such as being hard or soft bottom, the level of roughness/smoothness, or predominant seafloor type such as mud, sand, clay, cobble, among several others (Bellec et al., 2017; Valentine, Todd, & Kostylev, 2005)). The most used seabed mapping approaches use acoustic sensors: sidescan sonars, multi-beam echo-sounders and acoustic ground discrimination systems (Kenny et al., 2003). To a lesser extent, other sensors are also used, such as optical cameras and seabed samplers.

In typical operations, scientists and surveyors use seabed samplers to create ground-truth data for a specific zone on the sea. The obtained knowledge is transferred to video surveys obtained using towed video cameras or using AUV/ROV with cameras, thus obtaining a map with the occurrence of each sediment. This process is possible because an expert is able to detect images containing similar samples and distinguish them among others (the class obtained with the samplers is transferred to the closer images). The main limiting factor in this procedure is that optical imaging of the seabed is restricted to areas of shallow and clear water, as electromagnetic waves in the visible light spectrum (400–700 nm) are quickly attenuated by seawater, thus acoustic methods have become the main tool for seabed mapping. The drawback of this method is that it becomes more difficult to distinguish the different classes using only their information. Typically ground-truth information collected with various seabed samplers and underwater imaging techniques is used to create highly detailed maps (Diesing, Mitchell, & Stephens, 2016).

Optical imagery was utilized at a few sites to assess the potential of supervised classification to support the detection and characterization of underwater infrastructure such as UXO and benthic habitat. The main strength of underwater images of the seabed is their high spatial resolution relative to other acoustic wave or magnetic/electromagnetic field methods. Underwater images provide significant enhancement for situational awareness of the seabed and can be geo-registered and correlated with anomalies identified in EM or sonar maps (Shihavuddin et al., 2014; Shihavuddin, Gracias, Garcia, Gleason, & Gintert, 2013)

(Rimavicius & Gelzinis, 2017) shows that deep learning methods are suitable for seabed images classification. The work shows the performance of different techniques in classifying images of five benthic classes, including “Red algae”, “Sponge”, “Sand”, “Lithothamnium” and “Kelp”. In (King, Bhandarkar, & Hopkinson, 2018) a comparison of deep learning methods for semantic segmentation, i.e., patch-based convolutional neural network (CNN) approaches and fully convolutional neural network (FCNN) models, are studied in the context of classification of regions in underwater images of coral reef ecosystems into biologically meaningful categories.

Acoustic based sensors have the advantage of offering a wider field of view and are less affected by attenuation, but the data that they produce is more difficult for being interpreted directly by experts and needs some processing. In recent years advancements in deep learning allowed to process this data in a more efficient way, works like (Yan, Meng, & Zhao, 2020) proposes a 1D-CNN for bottom tracking with side scan sonar, in (Berthold, Leichter, Rosenhahn, Berkhahn, & Valerius, 2017) a model for the automatic sediment type classification of the side-scan sonar data is proposed, which is based on based on a patch-wise classification using ensemble voting. While the prediction of sand achieves a good accuracy, the accuracy for fine sediment is very poor in this work.
In (Luo et al., 2019) the CNN classifier can be applied to the classification of sediments based on a small-sized seabed acoustics image dataset, and the classification performance of shallow CNN was found to be better than that of the deep CNN on existing side-scan sonar data. In particular, the accuracy obtained from the results of several sediment classification experiments using a shallow CNN classifier ranged between 93.4% (Sand Wave) and 87.54% (Reef).


While AUV-based acoustic mapping has evolved dramatically, there is still a clear need to move the ability to interpret those maps from a post-mission expert-based analysis to an on-mission machine-learning capability. This will allow AUV mapping missions to be much more efficient. SmartSeafloorScan in DeeperSense aims at being a landmark in this topic. To the best of our knowledge, it will be the first attempt to incorporate a deep-learning approach running on acoustic data in real-time on an AUV for seabed classification.

SoA ML and Deeplearning in Robotics

Machine learning in robotics has gained wide popularity in fields such as inverse kinematics and dynamics, manipulation, autonomous navigation and locomotion (Herbort, Butz, & Pedersen, 2010; Kalakrishnan, Buchli, Pastor, & Schaal, 2009; Nguyen-Tuong, Seeger, & Peters, 2009; Plagemann et al., 2008; Schaal, Atkeson, & Vijayakumar, 2002; Thrun, 1998). A wide variety of learning methods for different applications were investigated throughout the literature. To name a few, Locally Weighted Projection Regression (LWPR) was used in (Schaal et al., 2002) to learn the inverse dynamics of a humanoid robot. Support Vector Regression (SVR) was used to learn the control of a biped robot in (Martins et al., 2007). In (Plagemann et al., 2008), Gaussian Process Regression (GPR) was used to learn terrain models for legged robot locomotion. The inverse dynamics of manipulator arms were modeled using local GPR in (Nguyen-Tuong et al., 2009), and LSTM network in (Rueckert, Nakatenus, Tosatto, & Peters, 2017). A deep-ReLU network was used to model the dynamics of a helicopter in (Punjani & Abbeel, 2015).

In the field of marine robotics, the application of machine learning is still relatively scarce. For instance, convolutional neural networks were used for sonar image recognition in (Valdenegro-Toro, 2017, 2019), moreover, the classification of AUV trajectories was studied in (Alvarez, Hastie, & Lane, 2017)

In (Nascimento & Valdenegro-Toro, 2018), RNN were used to model thruster faults. Yet model learning for underwater vehicles is still understudied in this field. In (Van De Ven, Johansen, Sørensen, Flanagan, & Toal, 2007), a neural network was used to identify only the damping term of the model, where a simulation of an AUV was used to train the network and no real sensory data was considered. In (Xu, An, Qiao, Zhu, & Li, 2013) least squares support vector regression (LS-SVR) was used to identify the Coriolis and centripetal acceleration term combined with the damping terms of a model underwater vehicle by using a dataset from a towing tank experiment. Validation of the model was only done with a simulated model. In (Fagogenis, Flynn, & Lane, 2014), locally weighted projection regression was used to compensate the mismatch between the physics-based model and the actual navigation data of the AUV Nessie. Auto-regressive network combined with a genetic algorithm was used to identify the model of a simulated AUV with variable mass in (Shafiei & Binazadeh, 2015). A symbolic regression method based on genetic programming was used in (Wu, Wang, Ge, Wu, & Yang, 2017) to find the best fitting model structure of a simulated AUV. This study however tries to automatically construct a parametric model through genetic programming from simple mathematical components.


In DeeperSense, we will exploit the power of data-driven machine learning to improve robot environment perception. State-of-the-art Deep Learning algorithms will be selected and customized / trained for underwater environment perception, combining visual and non-visual sensors. The trained algorithms will be optimized to run in real-time on board of an underwater vehicle. The application of Deep Learning and real-time execution in this area goes beyond the current SoA and represents a significant step forward towards a wider use of AI and data-driven ML methods in robotics.