Vis enkel innførsel

dc.contributor.authorGarg, Saurabh
dc.contributor.authorRuan, Haoyao
dc.contributor.authorHamarneh, Ghassan
dc.contributor.authorBehne, Dawn Marie
dc.contributor.authorJongman, Allard
dc.contributor.authorSereno, Joan
dc.contributor.authorWang, Yue
dc.date.accessioned2024-01-19T08:40:38Z
dc.date.available2024-01-19T08:40:38Z
dc.date.created2023-06-09T14:52:45Z
dc.date.issued2023
dc.identifier.citationInternational Journal of Speech Technology. 2023, .en_US
dc.identifier.issn1381-2416
dc.identifier.urihttps://hdl.handle.net/11250/3112680
dc.description.abstractHumans use both auditory and facial cues to perceive speech, especially when auditory input is degraded, indicating a direct association between visual articulatory and acoustic speech information. This study investigates how well an audio signal of a word can be synthesized based on visual speech cues. Specifically, we synthesized audio waveforms of the vowels in monosyllabic English words from motion trajectories extracted from image sequences in the video recordings of the same words. The articulatory movements were recorded in two different speech styles: plain and clear. We designed a deep network trained on mouth landmark motion trajectories on a spectrogram and formant-based custom loss for different speech styles separately. Human and automatic evaluation show that our framework using visual cues can generate identifiable audio of the target vowels from distinct mouth landmark movements. Our results further demonstrate that intelligible audio can be synthesized from novel unseen talkers that were independent of the training data.en_US
dc.language.isoengen_US
dc.publisherSpringeren_US
dc.titleMouth2Audio: intelligible audio synthesis from videos with distinctive vowel articulationen_US
dc.title.alternativeMouth2Audio: intelligible audio synthesis from videos with distinctive vowel articulationen_US
dc.typeJournal articleen_US
dc.description.versionpublishedVersionen_US
dc.rights.holder© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023en_US
dc.source.pagenumber0en_US
dc.source.journalInternational Journal of Speech Technologyen_US
dc.identifier.doi10.1007/s10772-023-10030-3
dc.identifier.cristin2153362
cristin.ispublishedtrue
cristin.fulltextoriginal
cristin.qualitycode0


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel