WET: Word embedding-topic distribution vectors for MOOC video lectures dataset

Kastrati, Zenun; Kurti, Arianit; Imran, Ali Shariq

Kastrati, Zenun; Kurti, Arianit; Imran, Ali Shariq

Journal article, Peer reviewed

Published version

Åpne

Kastrati.pdf (481.9Kb)

Permanent lenke

https://hdl.handle.net/11250/3033986

Utgivelsesdato

2020

Sammendrag

In this article, we present a dataset containing word embeddings and document topic distribution vectors generated from MOOCs video lecture transcripts. Transcripts of 12,032 video lectures from 200 courses were collected from Coursera learning platform. This large corpus of transcripts was used as input to two well-known NLP techniques, namely Word2Vec and Latent Dirichlet Allocation (LDA) to generate word embeddings and topic vectors, respectively. We used Word2Vec and LDA implementation in the Gensim package in Python. The data presented in this article are related to the research article entitled “Integrating word embeddings and document topics with deep learning in a video classification framework” [1]. The dataset is hosted in the Mendeley Data repository

Utgiver

Elsevier

Tidsskrift

Data in Brief

Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 4.0 Internasjonal