Combining Image and Depth Data for Efficient Semantic Segmentation

Junge, Lars Erik

Junge, Lars Erik

Master thesis

Åpne

18669_FULLTEXT.pdf (17.31Mb)

18669_COVER.pdf (1.556Mb)

Permanent lenke

http://hdl.handle.net/11250/2561058

Utgivelsesdato

2018

Metadata

Vis full innførsel

Samlinger

Institutt for teknisk kybernetikk [3740]

Sammendrag

Unmanned ground vehicles (UGVs) and other autonomous systems rely on sensors to understand their environments. These systems are often equipped with cameras which capture detailed descriptions of surrounding scenes. Computer vision systems which are concerned with extracting meaningful information from digital images are frequently used for scene understanding purposes for such systems. Modern deep learning techniques, and in particular convolutional neural networks (CNNs), have been successfully applied to analyzing and understanding images. CNN models are state-of-the-art for solving computer vision tasks such as image classification and semantic segmentation. Semantic segmentation involves classifying each pixel in an image as belonging to one of a set of classes. Modern CNN models have achieved impressive results on popular benchmarks for semantic segmentation when working with RGB image inputs. Incorporating data from other sensor modalities has the potential to improve perception capability further.

This thesis studies how ENet, a CNN model for real-time semantic segmentation from RGB inputs, can be extended to incorporate depth information. The network is modified by adding a feature extraction branch, which learns features from depth images. The depth features are fused into the feature maps from the RGB feature extraction branch at several points. The fusion is implemented as element-wise summation layers, placed throughout the encoder part of the network. Two new variants of the architecture are proposed which fuse features one and three times. The performance of both models are compared with the baseline ENet model which operates on only RGB inputs. We use two datasets to assess the performance of the different models. First we benchmark on the popular Cityscapes dataset to evaluate performance in urban scenes. A smaller dataset from forested scenes, the Freiburg forest dataset, is used to assess potential in more challenging off-road environments. The models are evaluated in terms of segmentation quality, using common metrics, as well as in terms of efficiency.

Fusing depth and RGB features at one location gave poorer segmentation quality than the baseline model which operates only on RGB inputs. Performance was improved by fusing depth and RGB features at several locations. The improvement is most noticeable for segmentation of spatially small classes, which the baseline model struggles with. The most conclusive results stem from the experiments performed on the Cityscapes dataset. The extended ENet model with improved results has a significantly slower inference speed, however the model remains small in size compared to other models.

Utgiver

NTNU