Having played around with flite a decade ago, and at that time feeling that it was already then nowhere close to the fidelity of other speech synthesis examples. I find it surprising that there still isn't anything better than festival/flite? It sounded then like a clunky robot, and still does today. Surely some of the many research projects have released their work as open source?
I thought I'd just add another fun fact/data point here. This is obviously my personal opinion. I have to use TTS to use my computer with a screen reader, and for that, I mostly prefer more synthetic speech. When I read long form text like books, articles, etc. I do prefer more natural voices, but for doing actual work like reading code or simply using user interfaces, I like the predictability of more synthetic/algorithmic speech. Apple added the neural Siri voices to the new VoiceOver. They sound incredible but the quality of the voice also brings latency with it. Something like ESpeak is much, much more performant and predictable, and it speeds up much better. I use my TTS at a very fast rate and I find that the more natural a voice, the harder it is to understand at that speech rate. Neural voices speak the same phrase of text differently every time it's uttered. Slightly different intonation, slightly different speech rhythm. This makes it hard to listen out for patterns. So for me there's definitely still a place for synthetic speech.
In fact (as I'm sure you know), one of the most beloved speech synthesizers among English-speaking blind users is a closed-source product called ETI-Eloquence that has been basically dead for nearly 20 years. (It was ported to Android several years ago, but that port was discontinued because they couldn't update it for 64-bit.) No recent speech synthesizer has quite matched its consistent intelligibility, particularly at high speeds. espeak-ng comes close, but it has a bad reputation (mostly, I think, leftover from earlier versions of espeak that really weren't very good).
Edit 2: To elaborate on what I mean by "mostly dead": In 2009 I was tasked with adding support for ETI-Eloquence to a Windows screen reader I developed. At that time, Nuance was still selling Eloquence to companies like the one I worked for back then. When I got the SDK, the timestamps on the files, particularly the main DLLs, were from 2002. As far as I know, an updated SDK for Windows was never released. I'm thankful for Windows's legendary emphasis on backward compatibility, particularly compared to Apple platforms and even Android.
Finally, a sample of espeak-ng (in the NVDA screen reader) at my preferred speed: https://mwcampbell.us/audio/espeak-ng-sample-2021-09-25.mp3 I use the default British pronunciation even though I'm American, because the American pronunciation is noticeably off.
> In fact (as I'm sure you know), one of the most beloved speech synthesizers among English-speaking blind users is a closed-source product called ETI-Eloquence that has been basically dead for nearly 20 years.
This is exactly the speech synthesizer I use daily. I've gotten so used to it over the years that switching away from it is painful.
On Apple platforms, though, using it is not an option. So I use Karen. Used to use Alex, but Karen appears to be slightly more responsive and tries to do less human stuff when reading. Responsiveness is a very important factor, actually. Probably more so than people might realize. Eloquence and ESpeak react pretty much instantly whereas other voices might take 100 MS or so. This is a very big deal for me. Just like how one would like instant visual feedback on their screen, it's the same for me with speech. The less latency, the better.
My problem with ESpeak is that it sounds very rough and metallic whereas Eloquence has a much warmer sound to it. I pitch mine down slightly to get an even warmer sound. Being pleasant on the ears is super important if you listen to the thing many, many hours a day.
I agree with you that Eloquence sounds warmer than eSpeak. I wish there was an open-source speech synthesizer comparable to Eloquence or even DECtalk. That approach to speech synthesis is old enough now that I'm sure there are published algorithms whose patents have expired. The problem, of course, would be funding the work on a good open-source implementation.
The HTS voice from NIT recommended in the article (voice_cmu_us_slt_arctic_hts) actually sounds much better than the clunky robot from a decade ago. Hear it here:
It does indeed sound much better yes. But, that voice was already there a decade ago. It's not... hm. Let me just say that I don't wish to disparage the work done on those projects, as I do think it is great. Maybe it better illustrates my point by taking a listen to this video which showcases the project I mentioned, as machine learning techniques have progress immensely the last decade: https://www.youtube.com/watch?v=-O_hYhToKoA
There are of course great benefits to something simple to use. I remember cross-compiling flite to run on a custom android/windows/linux project to generate voice lines intended for a in-game robot companion (nothing came of it though) based on SDL. It probably would not be nearly as feasible to do the same for some dependency-heavy machine learning library.
Now, I haven't done any research to find better examples of projects. I was just surprised how identical the article describes the options, to what was available 12 years ago.
Yes, I think that consumer-level state-of-the-art speech synthesis is still pretty far from acceptable. Amazon Polly doesn't sounds too great and presumably that should have more than enough big data to leverage and cloud computign to work with.
Some of the Amazon's voices sound amazing to me, I've actually tested few on them and people couldn't tell they're synthetic. Watson's voices are nice too.
(AFAIK Amazon bought Polish company IVONA for their Polly TTS system, which was long regarded to be one of the best)
I don't understand this approach when audio deepfakes exist that can quite realistically make Ayn Rand read arbitrary texts[1]. — Is it simply a matter of processing power?
Work like, say https://arxiv.org/abs/1806.04558 [paper]
https://github.com/CorentinJ/Real-Time-Voice-Cloning [repo]