Vis enkel innførsel

dc.contributor.advisorMartins, Thiago Guerrera
dc.contributor.authorKopczynski, Piotr Ludvig
dc.date.accessioned2023-01-04T18:19:37Z
dc.date.available2023-01-04T18:19:37Z
dc.date.issued2022
dc.identifierno.ntnu:inspera:104646180:36450488
dc.identifier.urihttps://hdl.handle.net/11250/3041056
dc.description.abstractI dette prosjektet ble dyplæringsmodeller trent på Youtube-8M datasettet, som er et stort benchmark for for multi-label videoklassifisering, og evaluert ved hjelp av F1-score. De trente modellene brukte forskjellige metoder for å representere video basert på videoens frames, og det ble gjort en sammenligning mellom dem. Metodene som ble brukt i prosjektet var Recurrent Neural Networks, Transformer-baserte nettverk, gjennomsnittling pooling og lærbar pooling som Deep Bag of Frames, Net Vectors of Locally Aggregated Descriptors, and Net Fisher Vectors. Eksperimenter med hyperparameterinnstilling, nettverksarkitektur, regularisering og det å legge til en lærbar ikke-lineær enhet kalt Context Gating ble utført for å forbedre F1-scoren til de enkelte modellene. Resultatene viste at for sekvensielle modeller, Recurrent Neural Networks ble utkonkurrert av Transformerbaserte modeller, som igjen ble utkonkurrert av alle pooling modeller med unntaket av Deep Bag of Frames, hvor modellen med høyest test F1-score var basert på Net Vectors of Locally Aggregated Descriptors.
dc.description.abstractIn this project, deep learning models were trained on the Youtube-8M dataset, which is a largescale benchmark for multi-label video classification, and evaluated using the F1-score metric. The trained models used different methods for representing video based on its frames, and a comparison was made between them. The methods used in the project were Recurrent Neural Networks, Transformer based networks, average pooling, and learnable pooling such as Deep Bag of Frames, Net Vectors of Locally Aggregated Descriptors, and Net Fisher Vectors. Experiments with hyperparameter tuning, network architecture, regularization and adding a learnable non-linear unit called Context Gating were performed in order to improve the F1-score of the individual models. The results showed that for sequential models, Recurrent Neural Networks were outperformed by Transformer based models, which again were outperformed by every pooling model except of Deep Bag of Frames, where the model having the highest test F1-score was based on Net Vectors of Locally Aggregated Descriptors.
dc.languageeng
dc.publisherNTNU
dc.titleVideo understanding with the Youtube-8M dataset
dc.typeMaster thesis


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel