Mouth2Audio: intelligible audio synthesis from videos with distinctive vowel articulation

Garg, Saurabh; Ruan, Haoyao; Hamarneh, Ghassan; Behne, Dawn Marie; Jongman, Allard; Sereno, Joan; Wang, Yue

dc.contributor.author	Garg, Saurabh
dc.contributor.author	Ruan, Haoyao
dc.contributor.author	Hamarneh, Ghassan
dc.contributor.author	Behne, Dawn Marie
dc.contributor.author	Jongman, Allard
dc.contributor.author	Sereno, Joan
dc.contributor.author	Wang, Yue
dc.date.accessioned	2024-01-19T08:40:38Z
dc.date.available	2024-01-19T08:40:38Z
dc.date.created	2023-06-09T14:52:45Z
dc.date.issued	2023
dc.identifier.citation	International Journal of Speech Technology. 2023, .	en_US
dc.identifier.issn	1381-2416
dc.identifier.uri	https://hdl.handle.net/11250/3112680
dc.description.abstract	Humans use both auditory and facial cues to perceive speech, especially when auditory input is degraded, indicating a direct association between visual articulatory and acoustic speech information. This study investigates how well an audio signal of a word can be synthesized based on visual speech cues. Specifically, we synthesized audio waveforms of the vowels in monosyllabic English words from motion trajectories extracted from image sequences in the video recordings of the same words. The articulatory movements were recorded in two different speech styles: plain and clear. We designed a deep network trained on mouth landmark motion trajectories on a spectrogram and formant-based custom loss for different speech styles separately. Human and automatic evaluation show that our framework using visual cues can generate identifiable audio of the target vowels from distinct mouth landmark movements. Our results further demonstrate that intelligible audio can be synthesized from novel unseen talkers that were independent of the training data.	en_US
dc.language.iso	eng	en_US
dc.publisher	Springer	en_US
dc.title	Mouth2Audio: intelligible audio synthesis from videos with distinctive vowel articulation	en_US
dc.title.alternative	Mouth2Audio: intelligible audio synthesis from videos with distinctive vowel articulation	en_US
dc.type	Journal article	en_US
dc.description.version	publishedVersion	en_US
dc.rights.holder	© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023	en_US
dc.source.pagenumber	0	en_US
dc.source.journal	International Journal of Speech Technology	en_US
dc.identifier.doi	10.1007/s10772-023-10030-3
dc.identifier.cristin	2153362
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	0

Tilhørende fil(er)

Filnavn:: 2023-Garg+et+al.pdf
Størrelse:: 2.014Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for psykologi [3143]
Publikasjoner fra CRIStin - NTNU [38679]

Vis enkel innførsel