I'd mentally change the acronym to Speech to Tokens. Parsing emotion and other n...

famouswaffles · on May 14, 2024

It could be added. Still wouldn't sound as good as what we have here. Audio is Audio and Text is Text and no amount of metadata we can practically provide will replace the information present in sound.

You can't exactly metadata your way out of this (skip to 11:50)

https://www.youtube.com/live/DQacCB9tDaw?si=yN7al6N3C7vCemhL

somenameforme · on May 14, 2024

I'm not sure why you say so? To me that seems obviously literally just swapping/weighting between a set of predefined voices. I'm sure you've played a game with a face generator - it's the exact same thing, except with audio. I'd also observe in the demo that they explicitly avoided anything particularly creative, instead sticking within an extremely narrow domain very basic adjectives: neutral, dramatic, singing, robotic, etc. I'm sure it also has happy, sad, angry, mad, and so on available.

But if the system can create a flamboyantly homosexual Captain Picard with a lisp and slight stutter engaging in overt innuendo when stating, "Number one, Engage!" then I look forward to eating crow! But as the instructions were all conspicuously just "swap to pretrained voice [x,y,z]", I suspect crow will not be on the menu any time soon.

defrost · on May 14, 2024

At least the training data exists: https://www.logotv.com/news/oypn0b/ian-mckellen-patrick-stew...

IanCal · on May 14, 2024

What about the input of the heavy breathing?

famouswaffles · on May 14, 2024

I'm sorry but you don't know what you're talking about and I'm done here. Clearly you've never worked with or tried to train STT or TTS models in any real capacity so inventing dramatic capabilities, disregarding latency and data requirements must come easily for you.

Open AI have explicitly made this clear. You are wrong. There's nothing else left to say here.