Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Obviously I don't mean literally zero compute. The amount of compute needed scales with the number of parameters, but I have yet to use a model that has so many parameters that token generation becomes compute bound. (Up to 104B for dense models.) During token generation most of the time is spent idle waiting for weights to transfer from memory. The processor is bored out of its mind waiting for more data. Memory bandwidth is the bottleneck.


It sounds like you aren’t batching efficiently if you are being bound by memory bandwidth.


That’s right, in the context of Apple silicon and Halo Strix, these use cases don’t involve much batching.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: