• norsk
    • English
  • English 
    • norsk
    • English
  • Login
View Item 
  •   Home
  • Fakultet for informasjonsteknologi og elektroteknikk (IE)
  • Institutt for datateknologi og informatikk
  • View Item
  •   Home
  • Fakultet for informasjonsteknologi og elektroteknikk (IE)
  • Institutt for datateknologi og informatikk
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Investigating Zero-Shot Learning techniques in multi-label scenarios

Sve, Thomas; Remmen, Bjørnar Moe
Master thesis
Thumbnail
View/Open
15964_FULLTEXT.pdf (4.424Mb)
15964_ATTACHMENT.zip (58.02Kb)
15964_COVER.pdf (1.556Mb)
URI
http://hdl.handle.net/11250/2459946
Date
2017
Metadata
Show full item record
Collections
  • Institutt for datateknologi og informatikk [3872]
Abstract
Visual recognition systems are often limited to the object categories previously trained on and thus suffer in their ability to scale. This is in part due to the difficulty of acquiring sufficient labeled images as the number of object categories grows. To solve this, earlier research have presented models that uses other sources, such as text data, to help classify object categories unseen during training. However, most of these models are limited on images with a single label and most images can contain more than one object category, and therefore more than one label. This master's thesis implements a model capable of classifying unseen categories for both single- and multi-labeled images.

The architecture consist of several modules: A pre-trained neural network that generates image features for each image, a model trained on text that represents words as vectors, and a neural network that projects the image features to the dimension native to the vector representation of words. On this architecture, we compared two approaches to generate word vectors using GloVe and Word2vec, with different vector dimensions and on spaces containing different numbers of word vectors. The model was adapted to multi-label predictions comparing three approaches for image box generation: YOLOv2, Faster R-CNN and randomly generated boxes. Here each box represents a section of the image cut out and this approach was chosen to fit each label to a one of these boxes.

The results showed that increasing the word vector dimension increased the accuracy, with Word2vec outperforming GloVe, and when adding more words to the word vector space the accuracy dropped. In the single-label scenario the model achieves similar results to existing models with similar architecture. While in the multi-label scenario, the model trained on boxes generated by Faster R-CNN and predicted on random generated boxes had highest accuracy, but was not able to outperform comparative alternatives. The architecture gives promising results, but more investigation is needed to answer if the results can be improved further.
Publisher
NTNU

Contact Us | Send Feedback

Privacy policy
DSpace software copyright © 2002-2019  DuraSpace

Service from  Unit
 

 

Browse

ArchiveCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsDocument TypesJournalsThis CollectionBy Issue DateAuthorsTitlesSubjectsDocument TypesJournals

My Account

Login

Statistics

View Usage Statistics

Contact Us | Send Feedback

Privacy policy
DSpace software copyright © 2002-2019  DuraSpace

Service from  Unit