LLMs are heavily subsidised. If you self-host them and run them at cost, then yo...

jsnell · 2025-06-09T13:40:29 1749476429

If you self-host, you likely won't have anywhere near enough volume to do efficient batching, and end up bottlenecked on memory rather than compute.

E.g. based on the calculations in https://www.tensoreconomics.com/p/llm-inference-economics-fr..., increasing batch size from 1 to 64 cuts the cost per token to 1/16th.

jbd0 · 2025-06-09T13:48:41 1749476921

Before I started self-hosting my LLMs with Ollama, I imagined that they required a ton of energy to operate. I was amazed at how quickly my local LLM operates with a relatively inexpensive GeForce RTX 4060 with 8GB VRAM and an 8b model. The 8b model isn't as smart as the hosted 70b models I've used, but it's still surprisingly useful.