Obviously I don't mean literally zero compute. The amount of compute needed scal... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		zargon 7 months ago \| parent \| context \| favorite \| on: Apple's MLX adding CUDA support Obviously I don't mean literally zero compute. The amount of compute needed scales with the number of parameters, but I have yet to use a model that has so many parameters that token generation becomes compute bound. (Up to 104B for dense models.) During token generation most of the time is spent idle waiting for weights to transfer from memory. The processor is bored out of its mind waiting for more data. Memory bandwidth is the bottleneck.

supermatt 7 months ago [–]

It sounds like you aren’t batching efficiently if you are being bound by memory bandwidth.

zargon 7 months ago | [–]

That’s right, in the context of Apple silicon and Halo Strix, these use cases don’t involve much batching.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact