Combining Image and Depth Data for Efficient Semantic Segmentation
Master thesis
Permanent lenke
http://hdl.handle.net/11250/2561058Utgivelsesdato
2018Metadata
Vis full innførselSamlinger
Sammendrag
Unmanned ground vehicles (UGVs) and other autonomous systems rely on sensors to understand their environments. These systems are often equipped with cameras which capture detailed descriptions of surrounding scenes. Computer vision systems which are concerned with extracting meaningful information from digital images are frequently used for scene understanding purposes for such systems. Modern deep learning techniques, and in particular convolutional neural networks (CNNs), have been successfully applied to analyzing and understanding images. CNN models are state-of-the-art for solving computer vision tasks such as image classification and semantic segmentation. Semantic segmentation involves classifying each pixel in an image as belonging to one of a set of classes. Modern CNN models have achieved impressive results on popular benchmarks for semantic segmentation when working with RGB image inputs. Incorporating data from other sensor modalities has the potential to improve perception capability further.
This thesis studies how ENet, a CNN model for real-time semantic segmentation from RGB inputs, can be extended to incorporate depth information. The network is modified by adding a feature extraction branch, which learns features from depth images. The depth features are fused into the feature maps from the RGB feature extraction branch at several points. The fusion is implemented as element-wise summation layers, placed throughout the encoder part of the network. Two new variants of the architecture are proposed which fuse features one and three times. The performance of both models are compared with the baseline ENet model which operates on only RGB inputs. We use two datasets to assess the performance of the different models. First we benchmark on the popular Cityscapes dataset to evaluate performance in urban scenes. A smaller dataset from forested scenes, the Freiburg forest dataset, is used to assess potential in more challenging off-road environments. The models are evaluated in terms of segmentation quality, using common metrics, as well as in terms of efficiency.
Fusing depth and RGB features at one location gave poorer segmentation quality than the baseline model which operates only on RGB inputs. Performance was improved by fusing depth and RGB features at several locations. The improvement is most noticeable for segmentation of spatially small classes, which the baseline model struggles with. The most conclusive results stem from the experiments performed on the Cityscapes dataset. The extended ENet model with improved results has a significantly slower inference speed, however the model remains small in size compared to other models.