• norsk
    • English
  • English 
    • norsk
    • English
  • Login
View Item 
  •   Home
  • Fakultet for informasjonsteknologi og elektroteknikk (IE)
  • Institutt for teknisk kybernetikk
  • View Item
  •   Home
  • Fakultet for informasjonsteknologi og elektroteknikk (IE)
  • Institutt for teknisk kybernetikk
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Combining Image and Depth Data for Efficient Semantic Segmentation

Junge, Lars Erik
Master thesis
Thumbnail
View/Open
18669_FULLTEXT.pdf (17.31Mb)
18669_COVER.pdf (1.556Mb)
URI
http://hdl.handle.net/11250/2561058
Date
2018
Metadata
Show full item record
Collections
  • Institutt for teknisk kybernetikk [4085]
Abstract
Unmanned ground vehicles (UGVs) and other autonomous systems rely on sensors to understand their environments. These systems are often equipped with cameras which capture detailed descriptions of surrounding scenes. Computer vision systems which are concerned with extracting meaningful information from digital images are frequently used for scene understanding purposes for such systems. Modern deep learning techniques, and in particular convolutional neural networks (CNNs), have been successfully applied to analyzing and understanding images. CNN models are state-of-the-art for solving computer vision tasks such as image classification and semantic segmentation. Semantic segmentation involves classifying each pixel in an image as belonging to one of a set of classes. Modern CNN models have achieved impressive results on popular benchmarks for semantic segmentation when working with RGB image inputs. Incorporating data from other sensor modalities has the potential to improve perception capability further.

This thesis studies how ENet, a CNN model for real-time semantic segmentation from RGB inputs, can be extended to incorporate depth information. The network is modified by adding a feature extraction branch, which learns features from depth images. The depth features are fused into the feature maps from the RGB feature extraction branch at several points. The fusion is implemented as element-wise summation layers, placed throughout the encoder part of the network. Two new variants of the architecture are proposed which fuse features one and three times. The performance of both models are compared with the baseline ENet model which operates on only RGB inputs. We use two datasets to assess the performance of the different models. First we benchmark on the popular Cityscapes dataset to evaluate performance in urban scenes. A smaller dataset from forested scenes, the Freiburg forest dataset, is used to assess potential in more challenging off-road environments. The models are evaluated in terms of segmentation quality, using common metrics, as well as in terms of efficiency.

Fusing depth and RGB features at one location gave poorer segmentation quality than the baseline model which operates only on RGB inputs. Performance was improved by fusing depth and RGB features at several locations. The improvement is most noticeable for segmentation of spatially small classes, which the baseline model struggles with. The most conclusive results stem from the experiments performed on the Cityscapes dataset. The extended ENet model with improved results has a significantly slower inference speed, however the model remains small in size compared to other models.
Publisher
NTNU

Contact Us | Send Feedback

Privacy policy
DSpace software copyright © 2002-2019  DuraSpace

Service from  Unit
 

 

Browse

ArchiveCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsDocument TypesJournalsThis CollectionBy Issue DateAuthorsTitlesSubjectsDocument TypesJournals

My Account

Login

Statistics

View Usage Statistics

Contact Us | Send Feedback

Privacy policy
DSpace software copyright © 2002-2019  DuraSpace

Service from  Unit