Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

LLMs are heavily subsidised. If you self-host them and run them at cost, then you find that the GPU costs are high, and that's largely without the additional tools that OpenAI and Anthropic provide and which also must cost a lot to operate.


If you self-host, you likely won't have anywhere near enough volume to do efficient batching, and end up bottlenecked on memory rather than compute.

E.g. based on the calculations in https://www.tensoreconomics.com/p/llm-inference-economics-fr..., increasing batch size from 1 to 64 cuts the cost per token to 1/16th.


Before I started self-hosting my LLMs with Ollama, I imagined that they required a ton of energy to operate. I was amazed at how quickly my local LLM operates with a relatively inexpensive GeForce RTX 4060 with 8GB VRAM and an 8b model. The 8b model isn't as smart as the hosted 70b models I've used, but it's still surprisingly useful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: