Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you self-host, you likely won't have anywhere near enough volume to do efficient batching, and end up bottlenecked on memory rather than compute.

E.g. based on the calculations in https://www.tensoreconomics.com/p/llm-inference-economics-fr..., increasing batch size from 1 to 64 cuts the cost per token to 1/16th.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: