dc.contributor.advisor | Martins, Thiago Guerrera | |
dc.contributor.author | Kopczynski, Piotr Ludvig | |
dc.date.accessioned | 2023-01-04T18:19:37Z | |
dc.date.available | 2023-01-04T18:19:37Z | |
dc.date.issued | 2022 | |
dc.identifier | no.ntnu:inspera:104646180:36450488 | |
dc.identifier.uri | https://hdl.handle.net/11250/3041056 | |
dc.description.abstract | I dette prosjektet ble dyplæringsmodeller trent på Youtube-8M datasettet, som er et stort benchmark for for multi-label videoklassifisering, og evaluert ved hjelp av F1-score. De trente modellene brukte forskjellige metoder for å representere video basert på videoens frames, og det ble gjort en sammenligning mellom dem. Metodene som ble brukt i prosjektet var Recurrent Neural Networks, Transformer-baserte nettverk, gjennomsnittling pooling og lærbar pooling som Deep Bag of Frames, Net Vectors of Locally Aggregated Descriptors, and Net Fisher Vectors. Eksperimenter med hyperparameterinnstilling, nettverksarkitektur, regularisering og det å legge til en lærbar ikke-lineær enhet kalt Context Gating ble utført for å forbedre F1-scoren til de enkelte modellene. Resultatene viste at for sekvensielle modeller, Recurrent Neural Networks ble utkonkurrert av Transformerbaserte modeller, som igjen ble utkonkurrert av alle pooling modeller med unntaket av Deep Bag of Frames, hvor modellen med høyest test F1-score var basert på Net Vectors of
Locally Aggregated Descriptors. | |
dc.description.abstract | In this project, deep learning models were trained on the Youtube-8M dataset, which is a largescale benchmark for multi-label video classification, and evaluated using the F1-score metric. The
trained models used different methods for representing video based on its frames, and a comparison was made between them. The methods used in the project were Recurrent Neural Networks,
Transformer based networks, average pooling, and learnable pooling such as Deep Bag of Frames,
Net Vectors of Locally Aggregated Descriptors, and Net Fisher Vectors. Experiments with hyperparameter tuning, network architecture, regularization and adding a learnable non-linear unit
called Context Gating were performed in order to improve the F1-score of the individual models.
The results showed that for sequential models, Recurrent Neural Networks were outperformed by
Transformer based models, which again were outperformed by every pooling model except of Deep
Bag of Frames, where the model having the highest test F1-score was based on Net Vectors of
Locally Aggregated Descriptors. | |
dc.language | eng | |
dc.publisher | NTNU | |
dc.title | Video understanding with the Youtube-8M dataset | |
dc.type | Master thesis | |