Spatial Bias in Vision-Based Voice Activity Detection

Stefanov, Kalin; Adiban, Mohammad; Salvi, Giampiero

dc.contributor.author	Stefanov, Kalin
dc.contributor.author	Adiban, Mohammad
dc.contributor.author	Salvi, Giampiero
dc.date.accessioned	2021-09-28T06:27:02Z
dc.date.available	2021-09-28T06:27:02Z
dc.date.created	2021-03-03T11:26:29Z
dc.date.issued	2021
dc.identifier.issn	1051-4651
dc.identifier.uri	https://hdl.handle.net/11250/2783882
dc.description.abstract	We develop and evaluate models for automatic vision-based voice activity detection (VAD) in multiparty human-human interactions that are aimed at complementing acoustic VAD methods. We provide evidence that this type of vision-based VAD models are susceptible to spatial bias in the dataset used for their development; the physical settings of the interaction, usually constant throughout data acquisition, determines the distribution of head poses of the participants. Our results show that when the head pose distributions are significantly different in the train and test sets, the performance of the vision-based VAD models drops significantly. This suggests that previously reported results on datasets with a fixed physical configuration may overestimate the generalization capabilities of this type of models. We also propose a number of possible remedies to the spatial bias, including data augmentation, input masking and dynamic features, and provide an in-depth analysis of the visual cues used by the developed vision-based VAD models.	en_US
dc.language.iso	eng	en_US
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)	en_US
dc.title	Spatial Bias in Vision-Based Voice Activity Detection	en_US
dc.type	Peer reviewed	en_US
dc.type	Journal article	en_US
dc.description.version	publishedVersion	en_US
dc.rights.holder	© IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	en_US
dc.source.journal	IEEE Computer Society	en_US
dc.identifier.doi	10.1109/ICPR48806.2021.9413345
dc.identifier.cristin	1895227
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	0

Tilhørende fil(er)

Filnavn:: 2900.pdf
Størrelse:: 3.087Mb
Format:: PDF
Beskrivelse:: Stefanov

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for elektroniske systemer [2289]
Publikasjoner fra CRIStin - NTNU [37237]

Vis enkel innførsel