Investigating Zero-Shot Learning techniques in multi-label scenarios

Sve, Thomas; Remmen, Bjørnar Moe

dc.contributor.advisor	Langseth, Helge
dc.contributor.advisor	Tidemann, Axel
dc.contributor.advisor	Banino, Cyril
dc.contributor.author	Sve, Thomas
dc.contributor.author	Remmen, Bjørnar Moe
dc.date.accessioned	2017-10-12T14:00:24Z
dc.date.available	2017-10-12T14:00:24Z
dc.date.created	2017-06-08
dc.date.issued	2017
dc.identifier	ntnudaim:15964
dc.identifier.uri	http://hdl.handle.net/11250/2459946
dc.description.abstract	Visual recognition systems are often limited to the object categories previously trained on and thus suffer in their ability to scale. This is in part due to the difficulty of acquiring sufficient labeled images as the number of object categories grows. To solve this, earlier research have presented models that uses other sources, such as text data, to help classify object categories unseen during training. However, most of these models are limited on images with a single label and most images can contain more than one object category, and therefore more than one label. This master's thesis implements a model capable of classifying unseen categories for both single- and multi-labeled images. The architecture consist of several modules: A pre-trained neural network that generates image features for each image, a model trained on text that represents words as vectors, and a neural network that projects the image features to the dimension native to the vector representation of words. On this architecture, we compared two approaches to generate word vectors using GloVe and Word2vec, with different vector dimensions and on spaces containing different numbers of word vectors. The model was adapted to multi-label predictions comparing three approaches for image box generation: YOLOv2, Faster R-CNN and randomly generated boxes. Here each box represents a section of the image cut out and this approach was chosen to fit each label to a one of these boxes. The results showed that increasing the word vector dimension increased the accuracy, with Word2vec outperforming GloVe, and when adding more words to the word vector space the accuracy dropped. In the single-label scenario the model achieves similar results to existing models with similar architecture. While in the multi-label scenario, the model trained on boxes generated by Faster R-CNN and predicted on random generated boxes had highest accuracy, but was not able to outperform comparative alternatives. The architecture gives promising results, but more investigation is needed to answer if the results can be improved further.
dc.language	eng
dc.publisher	NTNU
dc.subject	Informatikk, Kunstig intelligens
dc.title	Investigating Zero-Shot Learning techniques in multi-label scenarios
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: 15964_FULLTEXT.pdf
Størrelse:: 4.424Mb
Format:: PDF

Åpne

Filnavn:: 15964_ATTACHMENT.zip
Størrelse:: 58.02Kb
Format:: application/zip

Åpne

Filnavn:: 15964_COVER.pdf
Størrelse:: 1.556Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6551]

Vis enkel innførsel