The new Google Speech solution is the perfect example on why Google had to do their own silicon.
Doing speech with 16k samples a second through a NN and keep at a reasonable cost is really, really difficult.
The old way was far more power efficient and if you are going to use this new technique which gets you a far better result and do it at a reasonable cost you have to go all the way down into the silicon.
Can't wait to see the cost difference using Google TPUs and this technique versus traditional approaches.
Plus this approach support multi-core inherently. How would you ever do a tree search with multiple cores?
Ultimately to get the new applications we need Google and others doing the silicon. We are getting to extremes where the entire stack has to be tuned together.
I think Google vision for Lens is going to be a similar situation.
This somewhat blows my mind. Yes, it is impressive. However, the work that Nuance and similar companies used to do are still competitive, just not getting near the money and exposure.
I remember over a decade ago, they even had mood analysis they could apply to listening to people. Far from new. Is it truly more effective or efficient nowadays? Or just getting marketed by companies you've heard of?
"Nuance and similar companies used to do are still competitive"
Surprised. Curious if you can compare the inference per joules of Google 1st gen TPUs compared? Google shared a paper and the numbers are pretty impressive and was not aware of anyone else close to the gen 1 TPUs?
Here is the paper that you can use for the TPU side. Love so see someone else in the ball park? We really do not want just one company but competition.
This seems to be comparing them on their own terms. In more curious on features. Dragon naturally speaking and some other products have been really impressive for years now. Far beyond what my phone is capable of.
Not to say that the likes of the echo and others aren't impressive. Just that the speech recognition is the least of those products. Fully transcribed voice mail was available for years with Google voice (even before it was Google voice), yet that seems to happen less now than it did when I first for the product.
Did the old methods use neutral networks? I wouldn't be surprised if they did, but I would be surprised if they were as deep of networks as what people use today.
That is, I am interested in comparing them on speed of transcription, speech synthesises, error rates, etc. Not on speed of network execution.
It is truly better. Objective metrics (such as word error rate) don't lie. You can argue whether it makes sense to use, say, 100x compute to get 2x less error, but that's a different argument; I don't think anyone is really disputing improved quality.
Do you have a good comparison point? And not, hopefully, comparing to what they could do a decade ago. I'm assuming they didn't sit still. Did they?
I question whether it is just 100x compute. Feels like more, since naturally speaking and friends didn't hog the machine. Again, over a full decade ago.
More, the resources that Google has to throw at training are ridiculous. Well over 100x what was used to build the old models.
None of this is to say we should pack up and go back to a decade ago. I just worry that we do the opposite; where we ignore progress that was made a decade ago in favor of the new tricks alone.
The thing is it is not simply the training but the inference aspect would have require an incredible amount of compute compared to the old way of doing it.
Hope Google will do a paper like they did with the Gen 1 TPUs. Would love to see the difference in terms of joules per word spoke.
Doing speech with 16k samples a second through a NN and keep at a reasonable cost is really, really difficult.
The old way was far more power efficient and if you are going to use this new technique which gets you a far better result and do it at a reasonable cost you have to go all the way down into the silicon.
Here listen to the results.
https://cloudplatform.googleblog.com/2018/03/introducing-Clo...
Now I am curious on the cost difference Google as able to achieve. It is still going to be more then the old way but how close did Google come?
But my favorite new thing with these chips is the Jeff Dean paper.
https://www.arxiv-vanity.com/papers/1712.01208v1/
Can't wait to see the cost difference using Google TPUs and this technique versus traditional approaches.
Plus this approach support multi-core inherently. How would you ever do a tree search with multiple cores?
Ultimately to get the new applications we need Google and others doing the silicon. We are getting to extremes where the entire stack has to be tuned together.
I think Google vision for Lens is going to be a similar situation.