Plain-to-clear speech video conversion for enhanced intelligibility

Sachdeva, Shubam; Ruan, Haoyao; Hamarneh, Ghassan; Behne, Dawn Marie; Jongman, Allard; Sereno, Joan; Wang, Yue

dc.contributor.author	Sachdeva, Shubam
dc.contributor.author	Ruan, Haoyao
dc.contributor.author	Hamarneh, Ghassan
dc.contributor.author	Behne, Dawn Marie
dc.contributor.author	Jongman, Allard
dc.contributor.author	Sereno, Joan
dc.contributor.author	Wang, Yue
dc.date.accessioned	2023-05-15T13:44:15Z
dc.date.available	2023-05-15T13:44:15Z
dc.date.created	2023-01-09T22:58:04Z
dc.date.issued	2023
dc.identifier.citation	International Journal of Speech Technology. 2023, 26 163-184.	en_US
dc.identifier.issn	1381-2416
dc.identifier.uri	https://hdl.handle.net/11250/3067981
dc.description.abstract	Clearly articulated speech, relative to plain-style speech, has been shown to improve intelligibility. We examine if visible speech cues in video only can be systematically modified to enhance clear-speech visual features and improve intelligibility. We extract clear-speech visual features of English words varying in vowels produced by multiple male and female talkers. Via a frame-by-frame image-warping based video generation method with a controllable parameter (displacement factor), we apply the extracted clear-speech visual features to videos of plain speech to synthesize clear speech videos. We evaluate the generated videos using a robust, state of the art AI Lip Reader as well as human intelligibility testing. The contributions of this study are: (1) we successfully extract relevant visual cues for video modifications across speech styles, and have achieved enhanced intelligibility for AI; (2) this work suggests that universal talker-independent clear-speech features may be utilized to modify any talker’s visual speech style; (3) we introduce “displacement factor” as a way of systematically scaling the magnitude of displacement modifications between speech styles; and (4) the high definition generated videos make them ideal candidates for human-centric intelligibility and perceptual training studies.	en_US
dc.language.iso	eng	en_US
dc.publisher	Springer	en_US
dc.rights	Navngivelse 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/deed.no	*
dc.title	Plain-to-clear speech video conversion for enhanced intelligibility	en_US
dc.title.alternative	Plain-to-clear speech video conversion for enhanced intelligibility	en_US
dc.type	Peer reviewed	en_US
dc.type	Journal article	en_US
dc.description.version	publishedVersion	en_US
dc.source.pagenumber	163-184	en_US
dc.source.volume	26	en_US
dc.source.journal	International Journal of Speech Technology	en_US
dc.identifier.doi	10.1007/s10772-023-10018-z
dc.identifier.cristin	2103692
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	1

Files in this item

Name:: Sachdeva+%7E+Ruan+%7E+Hamarneh ...
Size:: 2.173Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Institutt for psykologi [2902]
Publikasjoner fra CRIStin - NTNU [37384]

Show simple item record

Except where otherwise noted, this item's license is described as Navngivelse 4.0 Internasjonal