> An alternative is that, by not knowing what you are doing, you may not see all the options that exist -- and when you hit a problem too hard, you just throw more hardware (GPUs) at it.
Maybe some will. I just explained that I'm running my models on CPUs, so I'm actually developing sparse and efficient resource constrained models that evaluate quickly.
I've been working with libtorch's JIT engine in Rust (tch.rs bindings).
I'm currently trying to adapt Melgan to the Voice Conversion problem domain so I can get real time, high-fidelity VC without using a classical vocoder. WORLD works great and quickly, but it's a poor substitute for the real thing as it only maps the fundamental frequency, spectral envelope, and aperiodicity. Melgan is super high quality and faaast.
Are you working on VC (input: speech of one speaker, output: the same spoken content, but sounds like another speaker) or speaker-adaptive speech synthesis (input: text, output: speech)?
Also check out ParallelWaveGAN, another high-quality and very fast (on CPU) neural vocoder.
Maybe some will. I just explained that I'm running my models on CPUs, so I'm actually developing sparse and efficient resource constrained models that evaluate quickly.
I've been working with libtorch's JIT engine in Rust (tch.rs bindings).
I'm currently trying to adapt Melgan to the Voice Conversion problem domain so I can get real time, high-fidelity VC without using a classical vocoder. WORLD works great and quickly, but it's a poor substitute for the real thing as it only maps the fundamental frequency, spectral envelope, and aperiodicity. Melgan is super high quality and faaast.