Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> but I'm not really sure about calling it "local-first" as it's still reliant on an `ANTHROPIC_API_KEY`.

See here:

https://github.com/localgpt-app/localgpt/blob/main/src%2Fage...

 help



What reasonable comparable model can be run locally on say 16GB of video memory compared to Opus 4.6? As far as I know Kimi (while good) needs serious GPUs GTX 6000 Ada minimum. More likely H100 or H200.

Devstral¹ has very good models that can be run locally.

They are in the top of open models, and surpass some closed models.

I've been using devstral, codestral and Le Chat exclusively for three months now. All from misteals hosted versions. Agentic, as completion and for day-to-day stuff. It's not perfect, but neither is any other model or product, so good enough for me. Less anecdotal are the various benchmarks that put them surprisingly high in the rankings

¹https://mistral.ai/news/devstral


Nothing will come close to Opus 4.6 here. You will be able to fit a destilled 20B to 30B model on your GPU. Gpt-oss-20B is quite good in my testing locally on a Macbook Pro M2 Pro 32GB.

The bigger downside, when you compare it to Opus or any other hosted model, is the limited context. You might be able to achieve around 30k. Hosted models often have 128k or more. Opus 4.6 has 200k as its standard and 1M in api beta mode.


There are local models with larger context, but the memory requirements explode pretty quickly so you need to lower parameter count or resort to heavy quantization. Some local inference platforms allow you to place the KV cache in system memory (while still otherwise using GPU). Then you can just use swap to allow for even very long contexts, but this slows inference down quite a bit. (The write load on KV cache is just appending a KV vector per inferred token, so it's quite compatible with swap. You won't be wearing out the underlying storage all that much.)

I made something similar to this project, and tested it against a few 3B and 8B models (Qwen and Ministral, both the instruction and the reasoning variants). I was pleasantly surprised by how fast and accurate these small models have become. I can ask it things like "check out this repo and build it", and with a Ralph strategy eventually it will succeed, despite the small context size.

Nothing close to Opus is available in open weights. That said, do all your tasks need the power of Opus?

The problem is that having to actively decide when to use Opus defeats much of the purpose.

You could try letting a model decide, but given my experience with at least OpenAI’s “auto” model router, I’d rather not.


I also don't like having to think about it, and if it were free, I would not bother even though keeping up a decent local alternative is a good defensive move regardless.

But let's face it. For most people Opus comes at a significant financial cost per token if used more than very casual, so using it for rather trivial or iterative tasks that nevertheless consume a lot of those is something to avoid.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: